I have a csv file with 10,000 rows, each row contains a link, and I want to download some info of each link. As that's a consuming task I manually splitted it in 4 Python scripts, each one working on 2,500 rows. After that I open 4 terminals and run each of the scripts.
However I wonder if there's a more efficient way of doing that. Up to now I have 4 scripts .py that I manually lunch. What happens if I have to do the same but with 1,000,000 rows? Should I manually create for example 50 scripts and in each script download the info of the rows of that script?. I hope I managed to explain myself :)
Thanks!
You don't need to do any manual splitting – set up a multiprocessing.Pool() with the number of workers you want to be processing your data, and have a function do your work for each item. A simplified example:
import multiprocessing
# This function is run in a separate process
def do_work(line):
return f"{line} is {len(line)} characters long. This result brought to you by {multiprocessing.current_process().name}"
def main():
work_items = [f"{2 ** i}" for i in range(1_000)] # You'd read these from your file
with multiprocessing.Pool(4) as pool:
for result in pool.imap(do_work, work_items, chunksize=20):
print(result)
if __name__ == "__main__":
main()
This has (up to) 4 processes working on your data, with, for optimization reasons, each worker getting 20 tasks to work on.
If you don't need the results to be in order, use the faster imap_unordered.
You can take a look at https://docs.python.org/3/library/asyncio-task.html to make the download + processing tasks async.
Use Threads to run multiple interpreter instances simultaneously (https://realpython.com/intro-to-python-threading)
Related
i used to use pygdrive3 to connect to google drive. Is there any wat either in this package or google-api-python-client with i could get more files with one request? The files are relative small, but i' d like to fetch 100 pieces at once.
Is there any method for this?
I could do of course to use .files().get_media(fileId=...).execute() 100 times but it' s a quite slow execution.
What I have done in one of my projects is to setup a thread pool and let each of the threads start a request. To do so try following snippets (which you need to adapt to your use case):
from pathos.threading import ThreadPool as Pool
N = 10 # number of threads
my_pool = = Pool(N)
my_pool.amap(<function>, <args>)
I'm writing a python script where I use multiproccesing library to launch multiple tesseract instances in parallel.
when I use multiple calls to tesseract but in sequence using loop ,it works .However ,when I try to parallel code everything looks fine but I'm not getting any results (I waited for 10 minutes ).
In my code I try to Ocrize multiple pdf pages after I split them from the original multi page PDF.
Here's my code :
def processPage(i):
nameJPG="converted-"+str(i)+".jpg"
nameHocr="converted-"+str(i)
p=subprocess.check_call(["tesseract",nameJPG,nameHocr,"-l","eng","hocr"])
print "tesseract did the job for the ",str(i+1),"page"
pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
As what i know of pytesseract it will not allow multiple processes if you have quadcore and you are running 4 processes simultaneously than tesseract will be choked and you will have high cpu usage and other stuffs if you require this for company and you dont want to go with google vision api you have to set multiple servers and do socket programming to request text from different servers so that number of parallel process are less than ability of your server to run different processes at same time like for quad core it should be 2 or 3
or other wise you can hit google vision api they have lot of servers and there output is quite good too
Disabling multiprocessing in tesseract will also help It can be done by setting OMP_THREAD_LIMIT=1 in the environment. but you must not run multiple process at same servers for tesseract
See https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167
Your code is launching a Pool and exiting before it finishes its job. You need to call close and join.
pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
pool1.close()
pool1.join()
Alternatively, you can wait for its results.
pool1=Pool(4)
print pool1.map(processPage, range(len(pdf.pages)))
I have a python script with a normal runtime of ~90 seconds. However, when I change only minor things in it (like alternating the colors in my final pyplot figure) and execute it thusly multiple times in quick succession, its runtime increases up to close to 10 minutes.
Some bullet points of what I'm doing:
I'm not downloading anything neither creating new files with my script.
I merely open some locally saved .dat-files using numpy.genfromtxt and crunch some numbers with them.
I transform my data into a rec-array and use indexing via array.columnname extensively.
For each file I loop over a range of criteria that basically constitute different maximum and minimum values for evaluation, and embedded in that I use an inner loop over the lines of the data arrays. A few if's here and there but nothing fancy, really.
I use the multiprocessing module as follows
import multiprocessing
npro = multiprocessing.cpu_count() # Count the number of processors
pool = multiprocessing.Pool(processes=npro)
bigdata = list(pool.map(analyze, range(len(FileEndings))))
pool.close()
with analyze being my main function and FileEndings its input, a string, to create the right name of the file I want to load and the evaluate. Afterwards, I use it a second time with
pool2 = multiprocessing.Pool(processes=npro)
listofaverages = list(pool2.map(averaging, range(8)))
pool2.close()
averaging being another function of mine.
I use numba's #jit decorator to speed up the basic calculations I do in my inner loops, nogil, nopython, and cache all set to be True. Commenting these out doesn't resolve the issue.
I run the scipt on Ubuntu 16.04 and am using a recent Anaconda build of python to compile.
I write the code in PyCharm and run it in its console most of the time. However, changing to bash doesn't help either.
Simply not running the script for about 3 minutes lets it go back to its normal runtime.
Using htop reveals that all processors are at full capacity when running. I am also seeing a lot of processes stemming from PyCharm (50 or so) that are each at equal MEM% of 7.9. The CPU% is at 0 for most of them, a few exceptions are in the range of several %.
Has anyone experienced such an issue before? And if so, any suggestions what might help? Or are any of the things I use simply prone to cause these problems?
May be closed, the problem was caused by a malfunction of the fan in my machine.
I'm starting up in Python and have done a lot of programming in the past in VB. Python seems much easier to work with and far more powerful. I'm in the process of ditching Windows altogether and have quite a few VB programs I've written and want to use them on Linux without having to touch anything that involves Windows.
I'm trying to take one of my VB programs and convert it to Python. I think I pretty much have.
One thing I never could find a way to do in VB was to use Program 1, a calling program, to call Program 2 and run in it multiple times. I had a program that would search a website looking for new updated material, everything was updated numerically(1234567890_1.mp3, for example). Not every value was used and I would have to search to find which files existed and which didn't. Typically the site would run through around 100,000 possible files a day with only 2-3 files actually being used each day. I had the program set up to search 10,000 files and if it found a file that existed it downloaded it and then moved to the next possible file and tested it. I would run this program, simultaneously 10 times and have each program set up to search a separate 10,000 file block. I always wanted to set up it so I could have a calling program that would have the user set the Main Block(1234) and the Secondary Block(5) with the Secondary Block possibly being a range of values. The calling program then would start up 10 separate programs(6, err 0-9 in reality) and would use Main Block and Secondary Block as the values to set up the call for each of the 10 Program 2s. When each one of the 10 programs got called, all running at the same time, they would be called with the appropriate search locations so they would be searching the website to find what new files had been added throughout the previous day. It would only take 35-45 minutes to complete each day versus multiple hours if I ran through everything in one long continuous program.
I think I could do this with Python using a Program 1(set) and Program 2(read) .txt file. I'm not sure if I would run into problems possibly with changing the set value before Program 2 could have read the value and started using it. I think I would have a to add a pause into the program to play it safe... I'm not really sure.
Is there another way I could pass a value from Program 1 to Program 2 to accomplish the task I'm looking to accomplish?
It seems simple enough to have a master class (a block as you would call it), that would contain all of the different threads/classes in a list. Communication wise, there could simply be a communication stucture like so:
Thread <--> Master <--> Thread
or
Thread <---> Thread <---> Thread
The latter would look more like a web structure.
The first option would probably be a bit more complicated since you have a "middle man". However, this does allow for mass communication. Also all that would need to passed to the other classes is the class Master, or a function that will provide communication
The second option allows for direct communication between threads. So if there is two threads that need to work together a lot, and you don't want to have to deal with other threads possibly interacting to the commands, then the second option is the best. However, the list of classes would have to be sent to the function.
Either way you are going to need to have a Master thread at the central of communication (for simplicity reasons). Otherwise you could get into sock and file stuff, but the other one is faster, more efficient, and is less likely to cause headaches.
Hopefully you understood what I was saying. It is all theoretical, but I have used similar systems in my programs.
There are a bunch of ways to do this; probably the simplest is to use os.spawnv to run multiple instances of Program2 and pass the appropriate values to each, ie
import os
import sys
def main(urlfmt, start, end, threads):
start, end, threads = int(start), int(end), int(threads)
per_thread = (end - start + 1) // threads
for i in range(threads):
s = str(start + i*per_thread)
e = str(min(s + per_thread - 1, end))
outfile = 'output{}.txt'.format(i+1)
os.spawnv(os.P_NOWAIT, 'program2.exe', [urlfmt, s, e, outfile])
if __name__=="__main__":
if len(sys.argv) != 4:
print('Usage: web_search.py urlformat start end threads')
else:
main(*(sys.argv))
I am attempting to write a wrapper that iterates through running another program with different input files. The program (over which I have no control, but need to use) needs to be run out of the same directory as the input file(s). So far my method is to use OS-module to change/create directory structure, use apply-async to run the program given a sub-directory, and each child in apply-async changes directory, creates the file, and the first 8 processes run successfully (i have 8 virtual cores)
However, I am queueing up to 100 of these processes (they run a simulation which takes a few minutes, I'm looking to optimize). I use "call" on the outside executable I am running. I thought everything was going great, but then after the 8th simulation runs, everything stops, I check 0 processes are running. It is as if the queue forgot about the other processes.
What can I do to fix this? I know my RAM only goes up about 300 MB out of 8GB.
Do I need to look into implementing some sort of queue myself that waits for the exit code of the simulation executable?
Thank you in advance.
Maybe better than nothing. This shows a correct way to use apply_async(), and demonstrates that - yup - there's no problem creating many more tasks than processes. I'd tell you how this differs from what you're doing, but I have no idea what you're doing ;-)
import multiprocessing as mp
def work(i):
from time import sleep
sleep(2.0 if i & 1 else 1.0)
return i*i
if __name__ == "__main__":
pool = mp.Pool(4)
results = [pool.apply_async(work, (i,)) for i in range(100)]
results = [r.get() for r in results]
print len(results), results
pool.close()
pool.join()