Im trying to make Instagram scraper with one Python lib. It goes well but its very slow. Im trying to speed it up by using multithreading but the other problem occurred.
This is the code without multithreading, it works good:
import threading
import instaloader
import time
L = instaloader.Instaloader()
def func1(name):
first = []
posts = instaloader.Profile.from_username(L.context, name).get_posts()
posts = list(posts)
for p in posts[0:5]:
first.append(p)
return first
def func2(name):
second = []
posts = instaloader.Profile.from_username(L.context, name).get_posts()
posts = list(posts)
for p in posts[5:10]:
second.append(p)
return second
t = time.time()
print(func1('eminem'))
print(func2('eminem'))
print(time.time()-t) # this is 47.43 seconds
But when I try to use multithreading, i see that time of execution of my code is much shorter, but I do not get the result, It does not work with 'return' statement. I need to use the return statement because this is only just a part of the code, so I cant use print.
This is the code with threads:
L = instaloader.Instaloader()
def func1(name):
first = []
posts = instaloader.Profile.from_username(L.context, name).get_posts()
posts = list(posts)
for p in posts[0:5]:
first.append(p)
return first
def func2(name):
second = []
posts = instaloader.Profile.from_username(L.context, name).get_posts()
posts = list(posts)
for p in posts[5:10]:
second.append(p)
return second
t = time.time()
t1 = threading.Thread(target = func1, args=('eminem',))
t2 = threading.Thread(target = func2, args=('eminem',))
t1.start()
t2.start()
t1.join()
t2.join()
print(time.time()-t) # this is 25.36 seconds
What am I doing wrong?
The easiest way in your case is to pass a shared data structure with distinguished keys to accumulate results from different functions:
Instead of using local lists first = [] ; second = [] - append result to shared structure like:
def func1(name, results):
...
for p in posts[0:5]:
results['func1'].append(p)
The same for func2 function.
results = {'func1': [], 'func2': []}
t1 = threading.Thread(target = func1, args=('eminem', results))
t2 = threading.Thread(target = func2, args=('eminem', results))
t1.start()
t2.start()
t1.join()
t2.join()
print(results)
Another option is using concurrent.futures.Executor.submit approach.
I've just released a module that could help you with your project and its scalability. Take a look at Akuanduba README file and see if it works for you.
Related
downloadStart = datetime.now()
while (True):
requestURL = transactionAPI.format(page = tempPage,limit = 5000)
response = requests.get(requestURL,headers=headers)
json_data = json.loads(response.content)
tempMomosTransactionHistory.extend(json_data["list"])
if(datetime.fromtimestamp(json_data["list"][-1]["crtime"]) < datetime(datetime.today().year,datetime.today().month,datetime.today().day - dateRange)):
break
tempPage += 1
downloadEnd = datetime.now()
Any suggestions please threading or something like that ?
Outputs here
downloadtime 0:00:02.056010
downloadtime 0:00:05.680806
downloadtime 0:00:05.447945
You need to improve it in two ways.
Optimise code within loop
Parallelize code execution
#1
By looking at your code I can see one improvement ie. create datetime.today object instead of doing 3 times. Check other methods like transactionAPI optimise further.
#2:
If you multi core CPU machine then you take advantage of machine by spanning thread per page. Refer to modified code of above.
import threading
def processRequest(tempPage):
requestURL = transactionAPI.format(page = tempPage,limit = 5000)
response = requests.get(requestURL,headers=headers)
json_data = json.loads(response.content)
tempMomosTransactionHistory.extend(json_data["list"])
downloadStart = datetime.now()
while (True):
#create thread per page
t1 = threading.Thread(target=processRequest, args=(tempPage, ))
t1.start()
#Fetch datetime today object once instaed 3 times
datetimetoday = datetime()
if(datetime.fromtimestamp(json_data["list"][-1]["crtime"]) < datetime(datetimetoday.year,datetimetoday.month,datetimetoday.day - dateRange)):
break
tempPage += 1
downloadEnd = datetime.now()
I'm trying to make a bot for IQ Option.
I already did it, but i did it one by one, like, i had to open 10 bots so i could check 10 pairs.
I've been trying all day long doing with ThreadPool, Threadings, map and starmap (i think i didn't use them as good as they can be).
The thing is: i'm checking pairs (EURUSD, EURAUD...) values of the last 100 minutes. When i do it one by one, it takes between 80 and 300ms to return each. I'm trying now to do this in a way that i could do like all the calls at the same time and get their results around the same time to their respective var.
Atm my code is like this:
from iqoptionapi.stable_api import IQ_Option
from functools import partial
from multiprocessing.pool import ThreadPool as Pool
from time import *
from datetime import datetime, timedelta
import os
import sys
import dados #my login data
import config #atm is just payoutMinimo = 0.79
parAtivo = {}
class PAR:
def __init__(self, par, velas):
self.par = par
self.velas = velas
self.lucro = 0
self.stoploss = 50000
self.stopgain = 50000
def verificaAbertasPayoutMinimo(API, payoutMinimo):
status = API.get_all_open_time()
profits = API.get_all_profit()
abertasPayoutMinimo = []
for x in status['turbo']:
if status['turbo'][x]['open'] and profits[x]['turbo'] >= payoutMinimo:
abertasPayoutMinimo.append(x)
return abertasPayoutMinimo
def getVelas(API, par, tempoAN, segundos, numeroVelas):
return API.get_candles(par, tempoAN*segundos, numeroVelas, time()+50)
def logVelas(velas, par):
global parAtivo
parAtivo[par] = PAR(par, velas)
def verificaVelas(API, abertasPayoutMinimo, tempoAN, segundos, numeroVelas):
pool = Pool()
global parAtivo
for par in abertasPayoutMinimo:
print(f"Verificando par {par}")
pool = Pool()
if par not in parAtivo:
callbackFunction = partial(logVelas, par=par)
pool.apply_async(
getVelas,
args=(API, par, tempoAN, segundos, numeroVelas),
callback=callbackFunction
)
pool.close()
pool.join()
def main():
tempoAN = 1
segundos = 60
numeroVelas = 20
tempoUltimaVerificacao = datetime.now() - timedelta(days=99)
global parAtivo
conectado = False
while not conectado:
API = IQ_Option(dados.user, dados.pwd)
API.connect()
if API.check_connect():
os.system("cls")
print("Conectado com sucesso.")
sleep(1)
conectado = True
else:
print("Erro ao conectar.")
sleep(1)
conectado = False
API.change_balance("PRACTICE")
while True:
if API.get_balance() < 2000:
API.reset_practice_balance()
if datetime.now() > tempoUltimaVerificacao + timedelta(minutes=5):
abertasPayoutMinimo = verificaAbertasPayoutMinimo(API, config.payoutMinimo)
tempoUltimaVerificacao = datetime.now()
verificaVelas(API, abertasPayoutMinimo, tempoAN, segundos, numeroVelas)
for item in parAtivo:
print(parAtivo[item])
break #execute only 1 time for testing
if __name__ == "__main__":
main()
#edit1: just complemented with more info, actually this is the whole code right now.
#edit2: when i print it like this:
for item in parAtivo:
print(parAtivo[item].velas[-1]['close']
I get:
0.26671
0.473878
0.923592
46.5628
1.186974
1.365679
0.86263
It's correct, the problem is it takes too long, like almost 3 seconds, the same as if i was doing without ThreadPool.
Solved.
Did it using threadings.Thread, like this:
for par in abertasPayoutMinimo:
t = threading.Thread(
target=getVelas,
args=(API, par, tempoAN, segundos)
)
t.start()
t.join()
I use a script to parce some sites and get news from there.
Each function in this script parse one site and return list of articles and then I want to combine them all in one big list.
If I parce site by site it takes to long and I desided to use multithreading.
I found a sample like this one in the bottom, but it seems not pithonic for me.
If I will add one more function to parse one more site, I will need to add each time the same block of code:
qN = Queue()
Thread(target=wrapper, args=(last_news_from_bar, qN)).start()
news_from_N = qN.get()
for new in news_from_N:
news.append(new)
Is there another solution to do this kind of stuff?
#!/usr/bin/python
# -*- coding: utf-8 -*-
from queue import Queue
from threading import Thread
def wrapper(func, queue):
queue.put(func())
def last_news_from_bar():
...
return list_of_articles #[['title1', 'http://someurl1', '2017-09-13'],['title2', 'http://someurl2', '2017-09-13']]
def last_news_from_foo():
...
return list_of_articles
q1, q2 = Queue(), Queue()
Thread(target=wrapper, args=(last_news_from_bar, q1)).start()
Thread(target=wrapper, args=(last_news_from_foo, q2)).start()
news_from_bar = q1.get()
news_from_foo = q2.get()
all_news = []
for new in news_from_bar:
news.append(new)
for new in news_from_foo:
news.append(new)
print(all_news)
Solution without Queue:
NEWS = []
LOCK = Lock()
def gather_news(url):
while True:
news = news_from(url)
if not news: break
with LOCK:
NEWS.append(news)
if __name__ == '__main__':
T = []
for url in ['url1', 'url2', 'url3']:
t = Thread(target=gather_news, args=(url,))
t.start()
T.append(t)
# Wait until all Threads done
for t in T:
t.join()
print(NEWS)
All, that you should do, is using a single queue and extend your result array:
q1 = Queue()
Thread(target=wrapper, args=(last_news_from_bar, q1)).start()
Thread(target=wrapper, args=(last_news_from_foo, q1)).start()
all_news = []
all_news.extend(q1.get())
all_news.extend(q1.get())
print(all_news)
I am trying to execute different methods in a pool object from the python multiprocessing library. I've tried too many ways but all of them get stuck when I call any of the methods .get() or .join(). I've googled a lot and none of the topics nor tutorials worked for me. My code is next:
def get_profile(artist_id):
buckets = ['years_active', 'genre', 'images']
artist = Artist(artist_id)
return artist.get_profile(buckets=buckets)
def get_songs(artist_id):
from echonest.playlist import Playlist
return Playlist().static(artist_ids=[artist_id])
def get_similar(artist_id):
artist = Artist(artist_id)
return artist.get_similar(min_familiarity=0.5, buckets=['images'])
def get_news(artist_id):
artist = Artist(artist_id)
print "Executing get_news"
return artist.get_news(high_relevance='true')
def search_artist(request, artist_name, artist_id):
from multiprocessing import Pool, Process
requests = [
dict(func=get_profile, args=(artist_id,)),
dict(func=get_songs, args=(artist_id,)),
dict(func=get_similar, args=(artist_id,)),
dict(func=get_news, args=(artist_id,))
]
pool = Pool(processes=2)
for req in requests:
result = pool.apply_async(req['func'], req['args'])
pool.close()
pool.join()
print "HERE IT STOPS AND NOTHING HAPPENS."
output = [p.get() for p in results]
Hope someone could help because I've been stuck with this for too long. Thank you in advance.
I have a piece of code that queries a DB and returns a set of IDs. For each ID, I need to run a related query to get a dataset. I would like to run the queries in parallel to speed up the processing. Once all the processes are run, then I build a block of text and write that to a file, then move to the next id.
How do I ensure that all the processes start at the same time, then wait for all of them to complete before moving to the page =... and writefile operations?
If run as it, I get the following error: Process object is not iterable (on line 9).
Here is what I have so far:
from helpers import *
import multiprocessing
idSet = getIDset(10)
for id in idSet:
ds1 = multiprocessing.Process(target = getDS1(id))
ds1list1, ds1Item1, ds1Item2 = (ds1)
ds2 = multiprocessing.Process(target = getDS2(id))
ds3 = multiprocessing.Process(target = getDS3(id))
ds4 = multiprocessing.Process(target = getDS4(id))
ds5 = multiprocessing.Process(target = getDS5(id))
movefiles = multiprocessing.Process(moveFiles(srcPath = r'Z://', src = ds1Item2 , dstPath=r'E:/new_data_dump//'))
## is there a better way to get them to start in unison than this?
ds1.start()
ds2.start()
ds3.start()
ds4.start()
ds5.start()
## how do I know all processes are finished before moving on?
page = +ds1+'\n' \
+ds2+'\n' \
+ds3+'\n' \
+ds4+'\n' \
+ds5+'\n'
writeFile(r'E:/new_data_dump/',filename+'.txt',page)
I usually keep my "processes" in a list.
plist = []
for i in range(0, 5) :
p = multiprocessing.Process(target = getDS2(id))
plist.append(p)
for p in plist :
p.start()
... do stuff ...
for p in plist :
p.join() # <---- this will wait for each process to finish before continuing
Also I think you have an issue with creating your Process. "target" is supposed to be a function. Not the result of a function as it seems you have it (unless your function returns functions).
It should look like this:
p = Process(target=f, args=('bob',))
Where target is the function, and args is a tuple of arguemnts passed like so:
def f(name) :
print name