I have this python program, now i want to do multiprocessing or multithreading for this. Please help me achieve this.
import os, sys, codecs, random, time ,subprocess
years = ["2016","2017","2018","2019","2020"]
rf = open('URL.txt', 'r')
lines = rf.readlines()
rf.close()
list = []
for element in lines:
list.append(element.strip())
files=["myfile1.txt","myfile2.txt"]
for url in list:
for year in years":
for file in files:
os.system('python myfile.py -u' +url+ ' -y' +year+ '-f' +file)
time.sleep(5)
I want to finish one url in one process or one thread.
You would add:
from multiprocessing import Pool
You would separate your work into a function:
def myfunc(url, year, file):
os.system('python myfile.py -u' +url+ ' -y' +year+ '-f' +file)
And then in place of the loop, you would make a list of argument tuples and send it to a pool using starmap:
pool = Pool(4) # <== number of processes to run in parallel
args = [(url, year, file) for url in lst for year in years for file in files]
pool.starmap(myfunc, args)
(Here I changed list to lst -- please also change the lines in your code that use list to lst instead, because list is a builtin.)
Update - just noticed "I want to finish one url in one process or one thread."
You can do a more coarse-grained division by putting some of the looping into the payload function:
def myfunc(url):
for year in years:
for file in files:
os.system('python myfile.py -u' +url+ ' -y' +year+ '-f' +file)
and then call it with just the URL - as it is only one argument, you don't need starmap any more, just map should work and the list of URLs
pool.map(myfunc, lst)
However, there is not much reason to divide it up in this way if the years and files can be done independently in parallel, because the coarse-grained division might mean that the job takes longer to complete (some processes are idle at the end while one is still working on a URL that is slow for some reason). I would still suggest the first approach.
Related
I am trying to download zip files containing a csv file.
On one hand I have more than 3000 URL therefore 3000 files. The code used took between 1 to 2 hours. The total size of the zip files is 40GB. Unzipped 230GB.
On the other hand there is also another set of URLs on the 100Ks. Looking at how long it took to process the previous number of URLs, is there something I can do to improve this code?
Should I make it all in one function?
I have the possibility to run this on a Spark cluster.
#URLs in a list called links
#Filepaths are a list from ls on the folder of the raw zip files
def download_zip_files(x,base_url,filepath):
r = requests.get(x)
status_code = r.status_code
filepath = str(x.replace(base_url,filepath))
with open(filepath, "wb") as file:
file.write(r.content)
def extract_zip_files(x,basepath,exportpath):
path = str(basepath+x[1])
with zipfile.ZipFile(path, "r") as zip_ref:
zip_ref.extractall(exportpath)
list(map(lambda x: download_zip_files(x,base_url,filepath), links))
list(map(lambda x: extract_zip_files(x,basepath,exportpath), raw_zip_filepaths))
What you need is multithreading, A quick google definition is as follows:
Multithreading is a CPU (central processing unit) feature that allows two or more instruction threads to execute independently while sharing the same processor resources. A thread is a self-contained sequence of instructions that can execute in parallel with other threads that are part of the same root process.
Most of the time the programs we write are executed in a single-threaded manner (same as yours) each instruction in the program executes in a sequence, which is slow if we have a case like yours, to run a program parallelly through multiple threads see the following example.
Single thread
Multi-thread approach
Please consider the output of both single-threaded and multi-threaded examples.
In a multi-threaded example, the program is executing in parallel through threads. In simple words, the print_words() function executes two times parallelly(at the same time) with different parameters.
Let's come to your example:
You can divide your URL list into multiple URL lists and give each thread a list of URLs, see the following example which is just a sudo code you should implement it by yourself.
import threading
# Please divide the following list using any function I'm giving a simple example right now, so I'm not doing this.
url_list=['url1','url2','url3','url4']
# Following lists are divided into two lists.
url_list_1=['url1','url2']
url_list_2=['url3','url4']
def download_zip_files(x,base_url,filepath):
r = requests.get(x)
status_code = r.status_code
filepath = str(x.replace(base_url,filepath))
with open(filepath, "wb") as file:
file.write(r.content)
def start_loop_download_zip(links):
list(map(lambda x: download_zip_files(x,base_url,filepath), links))
t1 = threading.Thread(target=start_loop_download_zip,args=(url_list_1,))
t2 = threading.Thread(target=start_loop_download_zip,args=(url_list_2,))
# starting thread 1
t1.start()
# starting thread 2
t2.start()
# wait until thread 1 is completely executed
t1.join()
# wait until thread 2 is completely executed
t2.join()
This approach will significantly reduce your processing time.
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I want to run several python script at the same time using concurrent.futures.
The serial version of my code go and look for a specific python file in folder and execute it.
import re
import os
import glob
import re
from glob import glob
import concurrent.futures as cf
FileList = [];
import time
FileList = [];
start_dir = os.getcwd();
pattern = "Read.py"
for dir,_,_ in os.walk(start_dir):
FileList.extend(glob(os.path.join(dir,pattern))) ;
FileList
i=0
for file in FileList:
dir=os.path.dirname((file))
dirname1 = os.path.basename(dir)
print(dirname1)
i=i+1
Str='python '+ file
print(Str)
completed_process = subprocess.run(Str)`
for the Parallel version of my code:
def Python_callback(future):
print(future.run_type, future.jid)
return "One Folder finished executing"
def Python_execute():
from concurrent.futures import ProcessPoolExecutor as Pool
args = FileList
pool = Pool(max_workers=1)
future = pool.submit(subprocess.call, args, shell=1)
future.run_type = "run_type"
future.jid = FileList
future.add_done_callback(Python_callback)
print("Python executed")
if __name__ == '__main__':
import subprocess
Python_execute()
The issue is that I am not sure how to pass each element of the FileList to separate cpu
Thanks for your help in advance
The smallest change is to use submit once for each element, instead of once for the whole list:
futures = []
for file in FileList:
future = pool.submit(subprocess.call, file, shell=1)
future.blah blah
futures.append(future)
The futures list is only necessary if you want to do something with the futures—wait for them to finish, check their return values, etc.
Meanwhile, you're explicitly creating the pool with max_workers=1. Not surprisingly, this means you'll only get 1 worker child process, so it'll end up waiting for one subprocess to finish before grabbing the next one. If you want to actually run them concurrently, remove that max_workers and let it default to one per core (or pass max_workers=8 or some other number that's not 1, if you have a good reason to override the default).
While we're at it, there are a lot of ways to simplify what you're doing:
Do you really need multiprocessing here? If you need to communicate with each subprocess, that can be painful to do in a single thread—but threads, or maybe asyncio, will work just as well as processes here.
More to the point, it doesn't look like you actually do need anything but launch the process and wait for it to finish, and that can be done in simple, synchronous code.
Why are you building a string and using shell=1 instead of just passing a list and not using the shell? Using the shell unnecessarily creates overhead, safety problems, and debugging annoyances.
You really don't need the jid on each future—it's just the list of all of your invocation strings, which can't be useful. What might be more useful is some kind of identifier, or the subprocess return code, or… probably lots of other things, but they're all things that could be done by reading the return value of subprocess.call or a simple wrapper.
You really don't need the callback either. If you just gather all the futures in a list and as_completed it, you can print the results as they show up more simply.
If you do both of the above, you've got nothing left but a pool.submit inside the loop—which means you can replace the entire loop with pool.map.
You rarely need, or want, to mix os.walk and glob. When you actually have a glob pattern, apply fnmatch over the files list from os.walk. But here, you're just looking for a specific filename in each dir, so really, all you need to filter on is file == 'Read.py'.
You're not using the i in your loop. But if you do need it, it's better to do for i, file in enumerate(FileList): than to do for file in FileList: and manually increment an i.
I want to make a command that searches, in parallel, a given number of files for a given word, where...
ppatternsearch [-p n] word {files}
ppatternsearch is the command name
-p is an option that defines the level of parallelization
n is the number of processes/threads that the -p option will
create for the word search
word is the word I'll be searching for
files is, as you can imagine, the files I'll be searching through.
I want to do this in 2 ways - one with processes and another with threads. In the end, the parent process/main thread returns the number of lines where it found the word that was being searched.
Thing is, I've developed some code already and I've hit a wall. I have no idea where to go from here.
import argparse, os, sys, time
num_lines_with_pattern = []
def pattern_finder(pattern, file_searched):
counter = 0
with open(file_searched, 'r') as ficheiro_being_read:
for line in ficheiro_being_read:
if pattern in line:
print line
counter += 1
num_lines_with_pattern.append(counter)
parser = argparse.ArgumentParser()
parser.add_argument('-p', type = int, default = 1, help = Defines command parallelization.')
args = parser.parse_args()
The next step is it import threading or multiprocessing and launch pattern_finder the appropriate number of times.
You'll probably also want to look into queue.Queue so your results aren't printed jumbled up.
The problem may be I/O bound and therefore introducing multiple threads/processes won't make your hard disk work any faster.
Though it should be easy to check. To run pattern_finder() using a process pool:
#!/usr/bin/env python
from functools import partial
from multiprocessing import Pool, cpu_count
def pattern_finder(pattern, file_searched):
...
return file_searched, number_of_lines_with_pattern
if __name__ == "__main__":
pool = Pool(n or cpu_count() + 1)
search = partial(pattern_finder, word)
for filename, count in pool.imap_unordered(search, files):
print("Found {count} lines in {filename}".format(**vars()))
I'm processing a list of thousands of domain names from a DNSBL through dig, creating a CSV of URLs and IPs. This is a very time-consuming process that can take several hours. My server's DNSBL updates every fifteen minutes. Is there a way I can increase throughput in my Python script to keep pace with the server's updates?
Edit: the script, as requested.
import re
import subprocess as sp
text = open("domainslist", 'r')
text = text.read()
text = re.split("\n+", text)
file = open('final.csv', 'w')
for element in text:
try:
ip = sp.Popen(["dig", "+short", url], stdout = sp.PIPE)
ip = re.split("\n+", ip.stdout.read())
file.write(url + "," + ip[0] + "\n")
except:
pass
Well, it's probably the name resolution that's taking you so long. If you count that out (i.e., if somehow dig returned very quickly), Python should be able to deal with thousands of entries easily.
That said, you should try a threaded approach. That would (theoretically) resolve several addresses at the same time, instead of sequentially. You could just as well continue to use dig for that, and it should be trivial to modify my example code below for that, but, to make things interesting (and hopefully more pythonic), let's use an existing module for that: dnspython
So, install it with:
sudo pip install -f http://www.dnspython.org/kits/1.8.0/ dnspython
And then try something like the following:
import threading
from dns import resolver
class Resolver(threading.Thread):
def __init__(self, address, result_dict):
threading.Thread.__init__(self)
self.address = address
self.result_dict = result_dict
def run(self):
try:
result = resolver.query(self.address)[0].to_text()
self.result_dict[self.address] = result
except resolver.NXDOMAIN:
pass
def main():
infile = open("domainlist", "r")
intext = infile.readlines()
threads = []
results = {}
for address in [address.strip() for address in intext if address.strip()]:
resolver_thread = Resolver(address, results)
threads.append(resolver_thread)
resolver_thread.start()
for thread in threads:
thread.join()
outfile = open('final.csv', 'w')
outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))
outfile.close()
if __name__ == '__main__':
main()
If that proves to start too many threads at the same time, you could try doing it in batches, or using a queue (see http://www.ibm.com/developerworks/aix/library/au-threadingpython/ for an example)
The vast majority of the time here is spent in the external calls to dig, so to improve that speed, you'll need to multithread. This will allow you to run multiple calls to dig at the same time. See for example: Python Subprocess.Popen from a thread . Or, you can use Twisted ( http://twistedmatrix.com/trac/ ).
EDIT: You're correct, much of that was unnecessary.
I'd consider using a pure-Python library to do the DNS queries, rather than delegating to dig, because invoking another process can be relatively time-consuming. (Of course, looking up anything on the internet is also relatively time-consuming, so what gilesc said about multithreading still applies) A Google search for python dns will give you some options to get started with.
In order to keep pace with the server updates, one must take less than 15 minutes to execute. Does your script take 15 minutes to run? If it doesn't take 15 minutes, you're done!
I would investigate caching and diffs from previous runs in order to increase performance.