i used to use pygdrive3 to connect to google drive. Is there any wat either in this package or google-api-python-client with i could get more files with one request? The files are relative small, but i' d like to fetch 100 pieces at once.
Is there any method for this?
I could do of course to use .files().get_media(fileId=...).execute() 100 times but it' s a quite slow execution.
What I have done in one of my projects is to setup a thread pool and let each of the threads start a request. To do so try following snippets (which you need to adapt to your use case):
from pathos.threading import ThreadPool as Pool
N = 10 # number of threads
my_pool = = Pool(N)
my_pool.amap(<function>, <args>)
Related
I have a csv file with 10,000 rows, each row contains a link, and I want to download some info of each link. As that's a consuming task I manually splitted it in 4 Python scripts, each one working on 2,500 rows. After that I open 4 terminals and run each of the scripts.
However I wonder if there's a more efficient way of doing that. Up to now I have 4 scripts .py that I manually lunch. What happens if I have to do the same but with 1,000,000 rows? Should I manually create for example 50 scripts and in each script download the info of the rows of that script?. I hope I managed to explain myself :)
Thanks!
You don't need to do any manual splitting – set up a multiprocessing.Pool() with the number of workers you want to be processing your data, and have a function do your work for each item. A simplified example:
import multiprocessing
# This function is run in a separate process
def do_work(line):
return f"{line} is {len(line)} characters long. This result brought to you by {multiprocessing.current_process().name}"
def main():
work_items = [f"{2 ** i}" for i in range(1_000)] # You'd read these from your file
with multiprocessing.Pool(4) as pool:
for result in pool.imap(do_work, work_items, chunksize=20):
print(result)
if __name__ == "__main__":
main()
This has (up to) 4 processes working on your data, with, for optimization reasons, each worker getting 20 tasks to work on.
If you don't need the results to be in order, use the faster imap_unordered.
You can take a look at https://docs.python.org/3/library/asyncio-task.html to make the download + processing tasks async.
Use Threads to run multiple interpreter instances simultaneously (https://realpython.com/intro-to-python-threading)
I'm writing a python script where I use multiproccesing library to launch multiple tesseract instances in parallel.
when I use multiple calls to tesseract but in sequence using loop ,it works .However ,when I try to parallel code everything looks fine but I'm not getting any results (I waited for 10 minutes ).
In my code I try to Ocrize multiple pdf pages after I split them from the original multi page PDF.
Here's my code :
def processPage(i):
nameJPG="converted-"+str(i)+".jpg"
nameHocr="converted-"+str(i)
p=subprocess.check_call(["tesseract",nameJPG,nameHocr,"-l","eng","hocr"])
print "tesseract did the job for the ",str(i+1),"page"
pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
As what i know of pytesseract it will not allow multiple processes if you have quadcore and you are running 4 processes simultaneously than tesseract will be choked and you will have high cpu usage and other stuffs if you require this for company and you dont want to go with google vision api you have to set multiple servers and do socket programming to request text from different servers so that number of parallel process are less than ability of your server to run different processes at same time like for quad core it should be 2 or 3
or other wise you can hit google vision api they have lot of servers and there output is quite good too
Disabling multiprocessing in tesseract will also help It can be done by setting OMP_THREAD_LIMIT=1 in the environment. but you must not run multiple process at same servers for tesseract
See https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167
Your code is launching a Pool and exiting before it finishes its job. You need to call close and join.
pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
pool1.close()
pool1.join()
Alternatively, you can wait for its results.
pool1=Pool(4)
print pool1.map(processPage, range(len(pdf.pages)))
I have simple Python script which do check few urls :
f = urllib2.urlopen(urllib2.Request(url))
as i have socket timeout setted on 5 seconds sometimes is bothering to wait 5sec * number of urls on results.
Is there any easy standartized way how to run those url checks asynchronously without big overhead. Script must use standart python components on vanilla ubuntu distribution (no additional installations).
Any ideas ?
I wrote something called multibench a long time ago. I used it for almost the same thing you want to do here, which was to call multiple concurrent instances of wget and see how long it takes to complete. It is a crude load testing and performance monitoring tool. You will need to adapt this somewhat, because this runs the same command n times.
Install additional software. It's a waste of time you re-invent something just because of some packaging decisions made by someone else.
For starters I'm new to python so my code below may not be the cleanest. For a program I need to download about 500 webpages. The url's are stored in an array which is populated by a previous function. The downloading part goes something like this:
def downloadpages(num):
import urllib
for i in range(0,numPlanets):
urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')
each file is only around 20KB but it takes at least 10 mins to download all of them. Downloading a single file of the total combined size should only take a minute or two. Is there a way I can speed this up? Thanks
Edit: To anyone who is interested, following the example at http://code.google.com/p/workerpool/wiki/MassDownloader and using 50 threads, the download time has been reduced to about 20 seconds from the original 10 minutes plus. The download speed continues to decrease as the threads are increased up until around 60 threads, after which the download time begins to rise again.
But you're not downloading a single file, here. You're downloading 500 separate pages, each connection involves overhead (for the initial connection), plus whatever else the server is doing (is it serving other people?).
Either way, downloading 500 x 20kb is not the same as downloading a single file of that size.
You can speed up execution significantly by using threads (be careful though, to not overload the server).
Intro material/Code samples:
http://docs.python.org/library/threading.html
Python Package For Multi-Threaded Spider w/ Proxy Support?
http://code.google.com/p/workerpool/wiki/MassDownloader
You can use greenlet to do so.
E.G with the eventlet lib:
urls = [url1, url2, ...]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
All calls in the pools will be pseudo simulatneous.
Of course you must install eventlet with pip or easy_install before.
You have several implementations of greenlets in Python. You could do the same with gevent or another one.
In addition to using concurrency of some sort, make sure whatever method you're using to make the requests uses HTTP 1.1 connection persistence. That will allow each thread to open only a single connection and request all the pages over that, instead of having a TCP/IP setup/teardown for each request. Not sure if urllib2 does that by default; you might have to roll your own.
i'm new to web apps so i'm not so used to worrying about CPU limits, but i looks i am going to have a problem with this code. I read in google's quotas page that i can use 6.5 CPU hours per day an 15 CPU , minutes per minute.
Google Said:
CPU time is reported in "seconds," which is equivalent to the number of CPU cycles that
can be performed by a 1.2 GHz Intel x86 processor in that amount of time. The actual
number of CPU cycles spent varies greatly depending on conditions internal to App Engine,
so this number is adjusted for reporting purposes using this processor as a reference
measurement.
And
Per Day Max Rate
CPU Time 6.5 CPU-hours 15 CPU-minutes/minute
What i want to Know:
Is this script going over the limit?
(if yes)How can i make it not go over the limit?
I use the urllib library, should i use Google's URL Fetch API? Why?
Absolutely any other helpful comment.
What it does:
It scrapes (crawls) project free TV. I will only completely run it once then replace it with a shorter faster script.
from urllib import urlopen
import re
alphaUrl = 'http://www.free-tv-video-online.me/movies/'
alphaPage = urlopen(alphaUrl).read()
patFinderAlpha = re.compile('<td width="97%" nowrap="true" class="mnlcategorylist"><a href="(.*)">')
findPatAlpha = re.findall(patFinderAlpha,alphaPage)
listIteratorAlpha = []
listIteratorAlpha[:] = range(len(findPatAlpha))
for ai in listIteratorAlpha:
betaUrl = 'http://www.free-tv-video-online.me/movies/' + findPatAlpha[ai] + '/'
betaPage = urlopen(betaUrl).read()
patFinderBeta = re.compile('<td width="97%" class="mnlcategorylist"><a href="(.*)">')
findPatBeta = re.findall(patFinderBeta,betaPage)
listIteratorBeta = []
listIteratorBeta[:] = range(len(findPatBeta))
for bi in listIteratorBeta:
gammaUrl = betaUrl + findPatBeta[bi]
gammaPage = urlopen(gammaUrl).read()
patFinderGamma = re.compile('<a href="(.*)" target="_blank" class="mnllinklist">')
findPatGamma = re.findall(patFinderGamma,gammaPage)
patFinderGamma2 = re.compile('<meta name="keywords"content="(.*)">')
findPatGamma2 = re.findall(patFinderGamma2,gammaPage)
listIteratorGamma = []
listIteratorGamma[:] = range(len(findPatGamma))
for gi in listIteratorGamma:
deltaUrl = findPatGamma[gi]
deltaPage = urlopen(deltaUrl).read()
patFinderDelta = re.compile("<iframe id='hmovie' .* src='(.*)' .*></iframe>")
findPatDelta = re.findall(patFinderDelta,deltaPage)
PutData( findPatGamma2[gi], findPatAlpha[ai], findPatDelt)
If I forgot anything please let me know.
Update:
This is about how many times it will run and why in case this is helpfull in answering the question.
per cycle total
Alpha: 1 1
Beta: 16 16
Gamma: ~250 ~4000
Delta: ~6 ~24000
I don't like to optimize until I need to. First, just try it. It might just work. If you go over quota, shrug, come back tomorrow.
To split jobs into smaller parts, look at the Task Queue API. Maybe you can divide the workload into two queues, one that scrapes pages and one that processes them. You can put limits on the queues to control how aggressively they are run.
P.S. On Regex for HTML: Do what works. The academics will call you out on semantic correctness, but if it works for you, don't let that stop you.
I use the urllib library, should i use Google's URL Fetch API? Why?
urlib on AppEngine production servers is The URLFetch API
It's unlikely that this will go over the free limit, but it's impossible to say without seeing how big the list of URLs it needs to fetch is, and how big the resulting pages are. The only way to know for sure is to run it - and there's really no harm in doing that.
You're more likely to run into the limitations on individual request execution - 30 seconds for frontend requests, 10 minutes for backend requests like cron jobs - than run out of quota. To alleviate those issues, use the Task Queue API to split your job into many parts. As an additional benefit, they can run in parallel! You might also want to look into Asynchronous URLFetch - though it's probably not worth it if this is just a one-off script.