I am planning to run Reverse DNS on 47 Million ips. Here is my code
with open(file,'r') as f:
with open ('./ip_ptr_new.txt','a') as w:
for l in f:
la = l.rstrip('\n')
ip,countdomain = la.split('|')
ips.append(ip)
try:
ais = socket.gethostbyaddr(ip)
print ("%s|%s|%s" % (ip,ais[0],countdomain), file = w)
except:
print ("%s|%s|%s" % (ip,"None",countdomain), file = w)
Currently it is very slow. Does anybody have any suggestions for speed it up?
Try using a multiprocessing module. I have timed the performance for about 8000 ips and I got this:
#dns.py
real 0m2.864s
user 0m0.788s
sys 0m1.216s
#slowdns.py
real 0m17.841s
user 0m0.712s
sys 0m0.772s
# dns.py
from multiprocessing import Pool
import socket
def dns_lookup(ip):
ip, countdomain = ip
try:
ais = socket.gethostbyaddr(ip)
print ("%s|%s|%s" % (ip,ais[0],countdomain))
except:
print ("%s|%s|%s" % (ip,"None",countdomain))
if __name__ == '__main__':
filename = "input.txt"
ips = []
with open(filename,'r') as f:
with open ('./ip_ptr_new.txt','a') as w:
for l in f:
la = l.rstrip('\n')
ip,countdomain = la.split('|')
ips.append((ip, countdomain))
p = Pool(5)
p.map(dns_lookup, ips)
#slowdns.py
import socket
from multiprocessing import Pool
filename = "input.txt"
if __name__ == '__main__':
ips = []
with open(filename,'r') as f:
with open ('./ip_ptr_new.txt','a') as w:
for l in f:
la = l.rstrip('\n')
ip,countdomain = la.split('|')
ips.append(ip)
try:
ais = socket.gethostbyaddr(ip)
print ("%s|%s|%s" % (ip,ais[0],countdomain), file = w)
except:
print ("%s|%s|%s" % (ip,"None",countdomain), file = w)
One solution here is to use the nslookup shell commande with the option timeout. Possibly the host command...
An example not perfect but useful!
def sh_dns(ip,dns):
a=subprocess.Popen(['timeout','0.2','nslookup','-norec',ip,dns],stdout=subprocess.PIPE)
sortie=a.stdout.read()
tab=str(sortie).split('=')
if(len(tab)>1):
return tab[len(tab)-1].strip(' \\n\'')
else:
return ""
We recently had to deal with this problem also.
Running on multiple processes didn't provide a good enough solution.
It could take several days to process few millions of IPs from a strong AWS machine.
What worked well is using Amazon EMR, it took around half an hour on 10 machines cluster.
You cannot scale too much with one machine (and usually one network interface) as it's a network intensive task. Using Map Reduce with multiple machines certainly did the work.
Related
I have a script that reads data via the serial port from a development board. I want to have this script upload the data to a MongoDB collection at the end of each loop, but I don't want the loop to block because of the upload. When I try to use the multiprocessing library to do so, the loop only uploads a blank document.
client = MongoClient()
db = client['CompostMonitor-1']
def upload_to_database(data):
# Connect to the collection where the data will be stored
collection = db.RedBoard
# Insert the data into the collection
collection.insert_one(data)
port = '/dev/ttyUSB0'
filename = '~/TestData'
containernumber = 1
baud_rate = 9600
RBSerial = serial.Serial(port, baud_rate, timeout=1)
directoryBase = "{}/{}/Bucket {}/RB".format(filename, time.strftime("%m-%d-%Y"), containernumber)
pathlib.Path(directoryBase).mkdir(parents=True, exist_ok=True)
logFileRB = '{}/RB_Bucket_{}_{}_{}_log.bin'.format(directoryBase, containernumber, time.strftime("%m-%d-%Y"),
time.strftime("%H;%M;%S"))
csvRB = '{}/RB_Bucket_{}_{}_{}.csv'.format(directoryBase, containernumber, time.strftime("%m-%d-%Y"),
time.strftime("%H;%M;%S"))
startup = True
count = 0
bytearray = []
RB_DataList = []
RB_DataDict = {}
header = ['Date/Time',
'SGP TVOC (ppb)',
'BME Humidity (%)',
'BME Pressure (Pa)',
'BME Temp (Deg C)']
startTime = time.time()
p = multiprocessing.Process(target=upload_to_database, args=(RB_DataDict,))
while 1:
RB_DataDict = {'_id': ''}
RB_inbyte = RBSerial.read(size=1)
with open(logFileRB, 'ab') as l:
l.write(RB_inbyte)
bytearray.append(RB_inbyte)
if RB_inbyte == b'\n':
bytearray.pop()
with open(csvRB, 'a', newline = '') as table:
writer = csv.writer(table)
if count == 0:
writer.writerow(header)
RB_DataSplit = ''.join(str(bytearray)).replace(" ", "").replace('b', '').replace("'", '').replace(",", '').\
replace('[', '').replace(']', '').split(';')
RB_DataList.append(time.strftime("%m-%d-%Y %H:%M:%S"))
for i in range(len(RB_DataSplit)):
RB_DataList.append(RB_DataSplit[i])
print(RB_DataList)
writer.writerow(RB_DataList)
RB_DataDict = {'Date_Time': RB_DataList[0], 'TVOC Con': RB_DataList[1], 'BME Humidity': RB_DataList[2],
'BME Pressure': RB_DataList[3], 'BME Temp': RB_DataList[4]}
print(RB_DataDict)
RB_DataList = []
# upload_to_database(RB_DataDict)
if startup:
p.start()
startup = False
bytearray = []
However, if I just call upload_to_database(RB_DataDict) as in the commented line, it works as intended. I thought that starting the process would have it continually upload RB_DataDict to my Mongo database, but it appears that it just runs one time and then stops.
I haven't found any examples of code attempting to use multiprocessing in an infinite loop, so it's hard to compare my code to something that works. How can I change this code so that it uploads RB_DataDict with the multiprocessing object each time the dictionary is populated?
I found a solution to my problem. I don't really understand why this works so well, but it does:
if __name__ == '__main__':
if startup:
p.start()
startup = False
print('Startup == False')
else:
# Close the process instance and start a new one!
p.close()
p = multiprocessing.Process(target= upload_to_database, args = (RB_DataDict,))
p.start()
print('should have uploaded something here')
Just closing the original process on the second loop and starting a new one fixes the issue. I'm not sure if, in my particular case, the if __name__ == '__main__' is necessary, as this script isn't intended to be imported for anything else, but I just followed the lead of the multiprocessing documentation.
I have the following code:
#!/usr/bin/env python
# coding=utf-8
import threading
import requests
import Queue
import sys
import re
#ip to num
def ip2num(ip):
ip = [int(x) for x in ip.split('.')]
return ip[0] << 24 | ip[1] << 16 | ip[2] << 8 | ip[3]
#num to ip
def num2ip(num):
return '%s.%s.%s.%s' % ((num & 0xff000000) >> 24,(num & 0x00ff0000) >> 16,(num & 0x0000ff00) >> 8,num & 0x000000ff)
def ip_range(start, end):
return [num2ip(num) for num in range(ip2num(start), ip2num(end) + 1) if num & 0xff]
def bThread(iplist):
threadl = []
queue = Queue.Queue()
for host in iplist:
queue.put(host)
for x in xrange(0, int(SETTHREAD)):
threadl.append(tThread(queue))
for t in threadl:
t.start()
for t in threadl:
t.join()
#create thread
class tThread(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while not self.queue.empty():
host = self.queue.get()
try:
checkServer(host)
except:
continue
def checkServer(host):
ports = [80]
for k in ports:
try:
aimurl = "http://"+host+":"+str(k)
response = requests.get(url=aimurl,timeout=3)
serverText = response.headers['server']
if (response.status_code) == 403:
print "-"*50+"\n"+aimurl +" Server: "+serverText
except:
pass
if __name__ == '__main__':
print '\n############# CDN IP #############'
print ' '
print '################################################\n'
global SETTHREAD
try:
SETTHREAD = sys.argv[2]
iplist = []
file = open(sys.argv[1], 'r')
tmpIpList = file.readlines()
for ip in tmpIpList:
iplist.append(ip.rstrip("\n"))
print '\nEscaneando '+str(len(iplist))+" IP's...\n"
bThread(iplist)
except KeyboardInterrupt:
print 'Keyboard Interrupt!'
sys.exit()
This script works as follows, a range of ip is entered:
python2 script.py 104.0.0.0-104.0.1.255 100 (100 is the number of threads)
I want to add support so that it reads the ip of a file, and that the range also works.
python2 script.py ips.txt 100
I tried this:
file = open(sys.argv[1], 'r')
iplist = file.readlines()
But it does not work.
Edit1: added file reading code recommended by user Syed Hasan, the problem seems to be the bThread(iplist) function
I assume you're attempting to use 'iplist' the same way as your CLI input was attempting to parse it. However, the readlines function simply reads the entire file at once and appends a newline (\n) at the end (provided you do format the IPs with a succeeding newline character).
Currently, you should be getting a list of IPs with a succeeding newline character. Try removing it from the rightmost end using rstrip:
file = open(sys.argv[1], 'r')
tmpIpList = file.readlines()
for ip in tmpIpList:
iplist.append(ip.rstrip("\n"))
How you switch between the two modes is a challenge you should attempt to solve. Perhaps use command-line parameter support to identify the mode of operations (look into the argparse library).
I am training python at my high school. We are still learning.
We did this small script where it loads a website using cookies and downloads files with 400k each. But for some reason, it is very slow. Our internet connection is very fast. It should be able to download 20-30 files at once, but for some reason, it downloads just one file at once and still waits some seconds before download the next. Why is this so? Please, check the script and give suggestions. No matter where we run the script, it always downloads a maximum of 7-8 files per minute. It is not right.
check out:
import urllib.request,string,random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
ab = 0
faturanum = 20184009433300
while (ab != 1000000):
try:
ab = ab + 1
opener = urllib.request.build_opener()
a = id_generator() + ".pdf"
faturanum = faturanum - 1
fatura = str(faturanum)
faturanome = fatura + ".pdf"
opener.addheaders = [('Cookie', 'ASP.NET_SessionId=gpkufgrfzsk5abc0bc2v2v3e')]
f = opener.open("https://urltest.com/Fatura/Pdf?nrFatura="+fatura)
file = open(faturanome, 'wb')
file.write(f.read())
file.close()
print("file downloaded:"+str(ab)+" downloaded!")
except:
pass
Why is this so slow? The remote server is very fast too. Is there some way to get better results? Maybe putting several files in the queue? Like I said, we are still learning. We just want to find a way the script makes several quests at once to get several files at once, instead of one at a time.
So this is what I wrote after hours.. but doesnt work
Well, here we go.
What am I doing wrong ? I am giving my best, but I never used threading before. Hey #KlausD, check it and let me know what am I doing wrong ? it is a website that requir cookies. Also it need to load the ul and turn to pdf.
check my code try
import os
import threading
import urllib.request,string,random
from queue import Queue
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
class Downloader(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
url = self.queue.get()
self.download_file(url)
self.queue.task_done()
def download_file(self, url):
faturanum = 20184009433300
ab = ab + 1
handle = urllib.request.urlopen(url)
a = id_generator() + ".pdf"
faturanum = faturanum - 1
fatura = str(faturanum)
faturanome = fatura + ".pdf"
handle.addheaders = [('Cookie', 'ASP.NET_SessionId=v5x4k1m44saix1f5hpybf2qu')]
fname = os.path.basename(url)
with open(fname, "wb") as f:
while True:
chunk = handle.read(1024)
if not chunk: break
f.write(chunk)
def main(urls):
queue = Queue()
for i in range(5):
t = Downloader(queue)
t.setDaemon(True)
t.start()
for url in urls:
queue.put(url)
queue.join()
if __name__ == "__main__":
urls = ["https://sitetest.com/Fatura/Pdf?nrFatura="+fatura"]
main(urls)
what is wrong THis is my best.. believe I am not sleeping to make it working
I have a cluster of computers which uses a master node to communicate with the slave nodes in the cluster.
The main problem I'm facing is using execnet is being able to kill certain jobs that are running and then having new jobs requeue on the same core that the other job just got terminated on (as I want to utilize all cores of the slave nodes at any given time).
As of now there is no way to terminate running jobs using execnet, so I figured if I could just kill the jobs manually through a bash script, say sudo kill 12345 where 12345 is the PID of the job (obtaining the PID of each job is another thing not supported by execnet, but that's another topic), then it would terminate the job and then requeue another on the same core that was just terminated on. It does kill the job correctly, however it closes the connection to that channel (the core; the master node communicates to each core individually) and then does not utilize that core anymore, until all jobs are done. Is there a way to terminate a running job, without killing the connection to the core?
Here is the script to submit jobs
import execnet, os, sys
import re
import socket
import numpy as np
import pickle, cPickle
from copy import deepcopy
import time
import job
def main():
print 'execnet source files are located at:\n {}/\n'.format(
os.path.join(os.path.dirname(execnet.__file__))
)
# Generate a group of gateways.
work_dir = '/home/mpiuser/pn2/'
f = 'cluster_core_info.txt'
n_start, n_end = 250000, 250008
ci = get_cluster_info(f)
group, g_labels = make_gateway_group(ci, work_dir)
mch = group.remote_exec(job)
args = range(n_start, n_end+1) # List of parameters to compute factorial.
manage_jobs(group, mch, queue, g_labels, args)
# Close the group of gateways.
group.terminate()
def get_cluster_info(f):
nodes, ncores = [], []
with open(f, 'r') as fid:
while True:
line = fid.readline()
if not line:
fid.close()
break
line = line.strip('\n').split()
nodes.append(line[0])
ncores.append(int(line[1]))
return dict( zip(nodes, ncores) )
def make_gateway_group(cluster_info, work_dir):
''' Generate gateways on all cores in remote nodes. '''
print 'Gateways generated:\n'
group = execnet.Group()
g_labels = []
nodes = list(cluster_info.keys())
for node in nodes:
for i in range(cluster_info[node]):
group.makegateway(
"ssh={0}//id={0}_{1}//chdir={2}".format(
node, i, work_dir
))
sys.stdout.write(' ')
sys.stdout.flush()
print list(group)[-1]
# Generate a string 'node-id_core-id'.
g_labels.append('{}_{}'.format(re.findall(r'\d+',node)[0], i))
print ''
return group, g_labels
def get_mch_id(g_labels, string):
ids = [x for x in re.findall(r'\d+', string)]
ids = '{}_{}'.format(*ids)
return g_labels.index(ids)
def manage_jobs(group, mch, queue, g_labels, args):
args_ref = deepcopy(args)
terminated_channels = 0
active_jobs, active_args = [], []
while True:
channel, item = queue.get()
if item == 'terminate_channel':
terminated_channels += 1
print " Gateway closed: {}".format(channel.gateway.id)
if terminated_channels == len(mch):
print "\nAll jobs done.\n"
break
continue
if item != "ready":
mch_id_completed = get_mch_id(g_labels, channel.gateway.id)
depopulate_list(active_jobs, mch_id_completed, active_args)
print " Gateway {} channel id {} returned:".format(
channel.gateway.id, mch_id_completed)
print " {}".format(item)
if not args:
print "\nNo more jobs to submit, sending termination request...\n"
mch.send_each(None)
args = 'terminate_channel'
if args and \
args != 'terminate_channel':
arg = args.pop(0)
idx = args_ref.index(arg)
channel.send(arg) # arg is copied by value to the remote side of
# channel to be executed. Maybe blocked if the
# sender queue is full.
# Get the id of current channel used to submit a job,
# this id can be used to refer mch[id] to terminate a job later.
mch_id_active = get_mch_id(g_labels, channel.gateway.id)
print "Job {}: {}! submitted to gateway {}, channel id {}".format(
idx, arg, channel.gateway.id, mch_id_active)
populate_list(active_jobs, mch_id_active,
active_args, arg)
def populate_list(jobs, job_active, args, arg_active):
jobs.append(job_active)
args.append(arg_active)
def depopulate_list(jobs, job_completed, args):
i = jobs.index(job_completed)
jobs.pop(i)
args.pop(i)
if __name__ == '__main__':
main()
and here is my job.py script:
#!/usr/bin/env python
import os, sys
import socket
import time
import numpy as np
import pickle, cPickle
import random
import job
def hostname():
return socket.gethostname()
def working_dir():
return os.getcwd()
def listdir(path):
return os.listdir(path)
def fac(arg):
return np.math.factorial(arg)
def dump(arg):
path = working_dir() + '/out'
if not os.path.exists(path):
os.mkdir(path)
f_path = path + '/fac_{}.txt'.format(arg)
t_0 = time.time()
num = fac(arg) # Main operation
t_1 = time.time()
cPickle.dump(num, open(f_path, "w"), protocol=2) # Main operation
t_2 = time.time()
duration_0 = "{:.4f}".format(t_1 - t_0)
duration_1 = "{:.4f}".format(t_2 - t_1)
#num2 = cPickle.load(open(f_path, "rb"))
return '--Calculation: {} s, dumping: {} s'.format(
duration_0, duration_1)
if __name__ == '__channelexec__':
channel.send("ready")
for arg in channel:
if arg is None:
break
elif str(arg).isdigit():
channel.send((
str(arg)+'!',
job.hostname(),
job.dump(arg)
))
else:
print 'Warnning! arg sent should be number | None'
Yes, you are on the right track. Use psutil library to manage the processes, find their pids etc.
And kill them. No need for involveing bash anywhere. Python covers it all.
Or, even better, program your script to terminate when master say so.
It is usually done that way.
You can even make it start another script before terminating itself if you want/need.
Or, if it is the same that you would be doing in another process, just stop the current work and start a new one in the script without terminating it at all.
And, if I may make a suggestion. Don't read your file line by line, read a whole file and then use *.splitlines(). For small files reading them in chunks just tortures the IO. You wouldn't be needing *.strip() as well. And you should remove unused imports too.
I am comparing scapy and dpkt in terms of speed. I have a directory with pcap files which I parse and count the http requests in each file. Here's the scapy code :
import time
from scapy.all import *
def parse(f):
x = 0
pcap = rdpcap(f)
for p in pcap:
try:
if p.haslayer(TCP) and p.getlayer(TCP).dport == 80 and p.haslayer(Raw):
x = x + 1
except:
continue
print x
if __name__ == '__main__':\
path = '/home/pcaps'
start = time.time()
for file in os.listdir(path):
current = os.path.join(path, file)
print current
f = open(current)
parse(f)
f.close()
end = time.time()
print (end - start)
The script is really slow (it gets stuck after a few minutes) compared to the dpkt version :
import dpkt
import time
from os import walk
import os
import sys
def parse(f):
x = 0
try:
pcap = dpkt.pcap.Reader(f)
except:
print "Invalid Header"
return
for ts, buf in pcap:
try:
eth = dpkt.ethernet.Ethernet(buf)
except:
continue
if eth.type != 2048:
continue
try:
ip = eth.data
except:
continue
if ip.p == 6:
if type(eth.data) == dpkt.ip.IP:
tcp = ip.data
if tcp.dport == 80:
try:
http = dpkt.http.Request(tcp.data)
x = x+1
except:
continue
print x
if __name__ == '__main__':
path = '/home/pcaps'
start = time.time()
for file in os.listdir(path):
current = os.path.join(path, file)
print current
f = open(current)
parse(f)
f.close()
end = time.time()
print (end - start)
So it there something wrong with the way I am using scapy? Or is it just that scapy is slower than dpkt?
You inspired me to compare. 2 GB PCAP. Dumb test. Simply counting the number of packets.
I'd expect this to be in single digit minutes with C++ / libpcap just based on previous timings of similar sized files. But this is something new. I wanted to prototype first. My velocity is generally higher in Python.
For my application, streaming is the only option. I'll be reading several of these PCAPs simultaneously and doing computations based on their contents. Can't just hold in memory. So I'm only comparing streaming calls.
scapy 2.4.5:
from scapy.all import *
import datetime
i=0
print(datetime.datetime.now())
for packet in PcapReader("/my.pcap"):
i+=1
else:
print(i)
print(datetime.datetime.now())
dpkt 1.9.7.2:
import datetime
import dpkt
print(datetime.datetime.now())
with open(pcap_file, 'rb') as f:
pcap = dpkt.pcap.Reader(f)
i=0
for timestamp, buf in pcap:
i+=1
else:
print(i)
print(datetime.datetime.now())
Results:
Packet count is the same. So that's good. :-)
dkpt - Just under 10 minutes.
scapy - 35 minutes.
dkpt went first. So if disk cache were helping a package, it would be scapy. And I think it might be marginally. I did this previously with scapy only, and it was over 40 minutes.
In summary, thanks for your 5 year old question. It's still relevant today. I almost bailed on Python here because of the overly long read speeds from scapy. dkpt seems substantially more performant.
Side note, alternative packages:
https://pypi.org/project/python-libpcap/ I'm on python 3.10 and 0.4.0 seems broken for me, unfortunately.
https://pypi.org/project/libpcap/ I'd like to compare timings to this, but have found it much harder to get a minimal example going. Haven't spent much time though, to be fair.