I'm trying to get this function work asynchronously (I have tried asyncio, threadpoolexecutor, processpoolexecutor and still no luck).
It takes around 11 seconds on my PC to complete a batch 500 items and there isno difference compared to plain for loop, so I assume It doesn't work as expected (in parallel).
here is the function:
from unidecode import unidecode
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
def is_it_bad(word):
for item in all_names:
if str(word) in str(item['name']):
return item
item = {'name':word, 'gender': 2}
return item
def check_word(arr):
fname = unidecode(str(arr[1]['fullname'] + ' ' + arr[1]['username'])).replace('([^a-z ]+)', ' ').lower()
fname = fname + ' ' + fname.replace(' ', '')
fname = fname.split(' ')
genders = []
for chunk in fname:
if len(chunk) > 2:
genders.append(int(is_it_bad('_' + chunk + '_')['gender']))
if set(genders) == {2}:
followers[arr[0]]['gender'] = 2
#results_new.append(name)
elif set([0,1]).issubset(genders):
followers[arr[0]]['gender'] = 2
#results_new.append(name)
else:
if 0 in genders:
followers[arr[0]]['gender'] = 0
#results_new.append(name)
else:
followers[arr[0]]['gender'] = 1
#results_new.append(name)
results = pool.map(check_word, [(idx, name) for idx, name in enumerate(names)])
Can you please help me with this
You are using the module "multiprocessing.dummy"
According to the documentation provided here,
multiprocessing.dummy replicates the API of multiprocessing but is no more than
a wrapper around the threading module.
The threading module does not provide the same speedup advantages as the multiprocessing module does because the threads in that module are executed serially. For more information on how to use the multiprocessing module, visit this tutorial (no affiliation).
In it, the author uses both multiprocessing.dummy and multiprocessing to accomplish two different tasks. You'll notice multiprocessing is the module used to provide the speedup. Just switch to that module and you should see an increase.
I am unable to run your code due to the unidecode package, but here is how I use multithreading in my previous projects and with the with your code:
import multiprocessing
#get maximum threads
max_threads = multiprocessing.cpu_count()
#max_threads = multiprocessing.cpu_count()-1 #I prefer to use 1 less core if i still wish to use my device
#create pool with max_threads
p = multiprocessing.Pool(max_threads)
#execute pool with function
results = p.map(check_word, [(idx, name) for idx, name in enumerate(names)])
Let me know if this works or helps!
Edit: Added some comments to the code
Related
I have over a million json files, and I'm trying to find the fastest way to check first, if they load, and then, if there exists either key_A, key_B, or neither. I thought I might be able to use ray to speed up this process, but opening a file seems to fail with ray.
As a simplification, here's my attempt at just checking whether or not a file will load:
import ray
ray.init()
#ray.remote
class Counter(object):
def __init__(self):
self.good = 0
self.bad = 0
def increment(self, j):
try:
with open(j, 'r') as f:
l = json.load(f)
self.good += 1
except: # all files end up here
self.bad += 1
def read(self):
return (self.good, self.bad)
counter = Counter.remote()
[counter.increment.remote(j) for j in json_paths]
futures = counter.read.remote()
print(ray.get(futures))
But I end up with (0, len(json_paths)) as a result.
For reference, the slightly more complicated actual end goal I have is to check:
new, old, bad = 0,0,0
try:
with open(json_path, 'r') as f:
l = json.load(f)
ann = l['frames']['FrameLabel']['annotations']
first_object = ann[0][0]
except:
bad += 1
return
if 'object_category' in first_object:
new += 1
elif 'category' in first_object:
old += 1
else:
bad += 1
I'd recommend not using Python for this at all, but for example jq.
A command like
jq -c "[input_filename, (.frames.FrameLabel.annotations[0][0]|[.object_category,.category])]" good.json bad.json old.json
outputs
["good.json",["good",null]]
["bad.json",[null,null]]
["old.json",[null,"good"]]
for each of your categories of data, which will be significantly easier to parse.
You can use e.g. the GNU find tool, or if you're feeling fancy, parallel, to come up with the command lines to run.
You could use Python' built-in concurrent module instead to perform your task, which ray might not be best-suited for. Example:
from concurrent.futures import ThreadPoolExecutor
numThreads = 10
def checkFile(path):
return True # parse and check here
with ThreadPoolExecutor(max_workers=numThreads) as pool:
good = sum(pool.map(checkFile, json_paths))
bad = len(json_paths) - good
I have to parse 30 days access logs from the server based on client IP and accessed hosts and need to know top 10 accessed sites. The log file will be around 10-20 GB in size which takes lots of time for single threaded execution of script. Initially, I wrote a script which was working fine but it is taking a lot of time to due to large log file size. Then I tried to implement multiprocessing library for parallel processing but it is not working. It seems implementation of multiprocessing is repeating tasks instead of doing parallel processing. Not sure, what is wrong in the code. Can some one please help on this? Thank you so much in advance for your help.
Code:
from datetime import datetime, timedelta
import commands
import os
import string
import sys
import multiprocessing
def ipauth (slave_list, static_ip_list):
file_record = open('/home/access/top10_domain_accessed/logs/combined_log.txt', 'a')
count = 1
while (count <=30):
Nth_days = datetime.now() - timedelta(days=count)
date = Nth_days.strftime("%Y%m%d")
yr_month = Nth_days.strftime("%Y/%m")
file_name = 'local2' + '.' + date
with open(slave_list) as file:
for line in file:
string = line.split()
slave = string[0]
proxy = string[1]
log_path = "/LOGS/%s/%s" %(slave, yr_month)
try:
os.path.exists(log_path)
file_read = os.path.join(log_path, file_name)
with open(file_read) as log:
for log_line in log:
log_line = log_line.strip()
if proxy in log_line:
file_record.write(log_line + '\n')
except IOError:
pass
count = count + 1
file_log = open('/home/access/top10_domain_accessed/logs/ipauth_logs.txt', 'a')
with open(static_ip_list) as ip:
for line in ip:
with open('/home/access/top10_domain_accessed/logs/combined_log.txt','r') as f:
for content in f:
log_split = content.split()
client_ip = log_split[7]
if client_ip in line:
content = str(content).strip()
file_log.write(content + '\n')
return
if __name__ == '__main__':
slave_list = sys.argv[1]
static_ip_list = sys.argv[2]
jobs = []
for i in range(5):
p = multiprocessing.Process(target=ipauth, args=(slave_list, static_ip_list))
jobs.append(p)
p.start()
p.join()
UPDATE AFTER CONVERSATION WITH OP, PLEASE SEE COMMENTS
My take: Split the file into smaller chunks and use a process pool to work on those chunks:
import multiprocessing
def chunk_of_lines(fp, n):
# read n lines from file
# then yield
pass
def process(lines):
pass # do stuff to a file
p = multiprocessing.Pool()
fp = open(slave_list)
for f in chunk_of_lines(fp,10):
p.apply_async(process, [f,static_ip_list])
p.close()
p.join() # Wait for all child processes to close.
There are many ways to implement the chunk_of_lines method, you could iterate over the file lines using a simple for or do something more advance like call fp.read().
imap version:
import os
import multiprocessing as mp
import timeit
import string
import random
PROCESSES = 5
FILE = 'test_imap.txt'
def remove_file():
try:
os.remove(FILE)
except FileNotFoundError:
pass
def produce(i):
return [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(32)) for i in range(100000)]
def imap_version():
with mp.Pool(PROCESSES) as p:
with open(FILE, 'a') as fp:
for lines in p.imap_unordered(produce, range(5)):
for line in lines:
fp.write(line + '\n')
if __name__ == '__main__':
remove_file()
imap_version_result = timeit.repeat("imap_version()", setup="from __main__ import imap_version", repeat=5, number=5)
print('imap result:', imap_version_result)
apply_async version:
import os
import multiprocessing as mp
import timeit
import string
import random
PROCESSES = 5
FILE = 'test_apply.txt'
def remove_file():
try:
os.remove(FILE)
except FileNotFoundError:
pass
def produce():
return [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(32)) for i in range(100000)]
def worker():
lines = produce()
with open(FILE, 'a') as fp:
for line in lines:
fp.write(line + '\n')
def apply_version():
with mp.Pool(PROCESSES) as p:
processes = []
for i in range(5):
processes.append(p.apply_async(worker))
while True:
if all((p.ready() for p in processes)):
break
if __name__ == '__main__':
remove_file()
apply_version_result = timeit.repeat("apply_version()", setup="from __main__ import apply_version", repeat=5, number=5)
print('apply result', apply_version_result)
Results:
imap result: [62.71130559899029, 62.65627204600605, 62.534730065002805, 62.67373917000077, 62.74415319500258]
apply result [72.03727042900573, 72.17959955699916, 72.2304800950078, 72.02653418600676, 72.11620796499483]
I expected imap to be slower because child processes need to pickle the results to the main process and then write to file, whereas each child process in apply_async directly write the results to file. Instead, imap is slower than apply_async.
Why is this so?
nb: This was done using Python 3.4.3 on Mac OS X 10.11
A quick glance at your source code shows that the imap_version() opens your output file once per process where apply_version() opens it once per worker which is 5 times per process due to being inside your range(5) loop.
with open(FILE, 'a') as fp is called 125 times in your async version vs 25 times in your imap version.
My guess is the busy loop is the culprit (besides it being an anti-pattern in its own right).
By checking the state yourself, you do redundant work: multiprocessing's machinery does pretty much the same with the work queue behind the scenes (in multiprocessing.pool.Pool._handle_workers() running in a separate thread). On the other hand, IMapIterator.next uses threading.Condition(threading.Lock()) to suspend the main thread's execution until an item is ready (so _handle_workers runs unhindered - remember that only one thread can run Python code at each moment).
Anyway, this is but another guess. The only decisive evidence would be a profiling result.
I want to download 20 csv files with the size of all of them together - 5MB.
Here is the first version of my code:
import os
from bs4 import BeautifulSoup
import urllib.request
import datetime
def get_page(url):
try:
return urllib.request.urlopen(url).read()
except:
print("[warn] %s" % (url))
raise
def get_all_links(page):
soup = BeautifulSoup(page)
links = []
for link in soup.find_all('a'):
url = link.get('href')
if '.csv' in url:
return url
print("[warn] Can't find a link with CSV file!")
def get_csv_file(company):
link = 'http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices'
g = link.find('s=')
name = link[g + 2:g + 6]
link = link.replace(name, company)
urllib.request.urlretrieve(get_all_links(get_page(link)), os.path.join('prices', company + '.csv'))
print("[info][" + company + "] Download is complete!")
if __name__ == "__main__":
start = datetime.datetime.now()
security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
for security in security_list:
get_csv_file(security)
end = datetime.datetime.now()
print('[success] Total time: ' + str(end-start))
This code downloads 20 csv files with the size of all of them together - 5MB, within 1.2 minute.
Then i have tried to use multiprocessing to make it download faster.
Here is version 2:
if __name__ == "__main__":
import multiprocessing
start = datetime.datetime.now()
security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
for i in range(20):
p = multiprocessing.Process(target=hP.get_csv_files([index] + security_list), args=(i,))
p.start()
end = datetime.datetime.now()
print('[success] Total time: ' + str(end-start))
But, unfortunately version 2 downloads 20 csv files with the size of all of them together - 5MB, within 2.4 minutes.
Why multiprocessing slowdowns my program?
What am I doing wrong?
What is the best way to download these files faster than now?
Thank you?
I don't know what exactly you are trying to start with Process in your example (I think you have a few typos). I think you want something like this:
processs = []
for security in security_list:
p = multiprocessing.Process(target=get_csv_file, args=(security,))
p.start()
processs.append(p)
for p in processs:
p.join()
You can iterate in this way over the security, create a new process for each security name and put the process in a list.
After you started all the processes, you loop over them and wait for them to finish, using join.
There is also a simpler way to do this, using Pool and its parallel map implementation.
pool = multiprocessing.Pool(processes=5)
pool.map(get_csv_file, security_list)
You create a Pool of processes (if you omit the argument, it will create a number equal to your processor count), and then you apply your function to each element in the list using map. The pool will take care of the rest.
I am using Python Multiprocessing module to scrape a website. Now this website has over 100,000 pages. What I am trying to do is to put every 500 pages I retrieve into a separate folder. The problem is that though I successfully create a new folder, my script only populates the previous folder. Here is the code:
global a = 1
global b = 500
def fetchAfter(y):
global a
global b
strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"
if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):
f = open(strfile, "w")
if __name__ == '__main__':
start = time.time()
for i in range(1,3):
os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))
pool = Pool(processes=12)
pool.map(fetchAfter, range(a,b))
pool.close()
pool.join()
a = b
b = b + 500
print time.time()-start
It is best for the worker function to only rely on the single argument it gets for determining what to do. Because that is the only information it gets from the parent process every time it is called. This argument can be almost any Python object (including a tuple, dict, list) so you're not really limited in the amount of information you pass to a worker.
So make a list of 2-tuples. Each 2-tuple should consist of (1) the file to get and (2) the directory where to stash it. Feed that list of tuples to map(), and let it rip.
I'm not sure if it is useful to specify the number of processes you want to use. Pool generally uses as many processes as your CPU has cores. That is usually enough to max out all the cores. :-)
BTW, you should only call map() once. And since map() blocks until everything is done, there is no need to call join().
Edit: Added example code below.
import multiprocessing
import requests
import os
def processfile(arg):
"""Worker function to scrape the pages and write them to a file.
Keyword arguments:
arg -- 2-tuple containing the URL of the page and the directory
where to save it.
"""
# Unpack the arguments
url, savedir = arg
# It might be a good idea to put a random delay of a few seconds here,
# so we don't hammer the webserver!
# Scrape the page. Requests rules ;-)
r = requests.get(url)
# Write it, keep the original HTML file name.
fname = url.split('/')[-1]
with open(savedir + '/' + fname, 'w+') as outfile:
outfile.write(r.text)
def main():
"""Main program.
"""
# This list of tuples should hold all the pages...
# Up to you how to generate it, this is just an example.
worklist = [('http://www.foo.org/page1.html', 'dir1'),
('http://www.foo.org/page2.html', 'dir1'),
('http://www.foo.org/page3.html', 'dir2'),
('http://www.foo.org/page4.html', 'dir2')]
# Create output directories
dirlist = ['dir1', 'dir2']
for d in dirlist:
os.makedirs(d)
p = Pool()
# Let'er rip!
p.map(processfile, worklist)
p.close()
if __name__ == '__main__':
main()
Multiprocessing, as the name implies, uses separate processes. The processes you create with your Pool do not have access to the original values of a and b that you are adding 500 to in the main program. See this previous question.
The easiest solution is to just refactor your code so that you pass a and b to fetchAfter (in addition to passing y).
Here's one way to implement it:
#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib
def download_page(url_path):
try:
urllib.urlretrieve(*url_path)
mp.get_logger().info('done %s' % (url_path,))
except Exception as e:
mp.get_logger().error('failed %s: %s' % (url_path, e))
def generate_url_path(rootdir, urls_per_dir=500):
for i in xrange(100*1000):
if i % urls_per_dir == 0: # make new dir
dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
if not os.path.isdir(dirpath):
os.makedirs(dirpath) # stop if it fails
url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
path = os.path.join(dirpath, '%d.html' % (i,))
yield url, path
def main():
mp.log_to_stderr().setLevel(logging.INFO)
pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
# due to the task is IO-bound
for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
pass
if __name__ == '__main__':
main()
See also Python multiprocessing pool.map for multiple arguments and the code
Brute force basic http authorization using httplib and multiprocessing from how to make HTTP in Python faster?