Speed up my API calls using Python THREADS - python

So this is what I currently have. This code makes about 5,000 calls to the NBA API and returns the total Games Played and Points Scored of every NBA player who has ever played in the playoffs. The players (names as keys, stats as values) are all added to the 'stats_dict' dictionary.
MY QUESTION IS THIS: does anybody know how I could significantly increase the speed of this process by using threading? Right now, it takes about 30 minutes to make all these API calls, which of course I would love to significantly improve upon. I've never used threads before and would appreciate any guidance.
Thanks
import pandas as pd
from nba_api.stats.endpoints import commonallplayers
from nba_api.stats.endpoints import playercareerstats
import numpy as np
player_data = commonallplayers.CommonAllPlayers(timeout = 30)
player_df = player_data.common_all_players.get_data_frame().set_index('PERSON_ID')
id_list = player_df.index.tolist()
def playoff_stats(person_id):
player_stats = playercareerstats.PlayerCareerStats(person_id, timeout = 30)
yield player_stats.career_totals_post_season.get_data_frame()[['GP', 'PTS']].values.tolist()
stats_dict = {}
def run_it():
for i in id_list:
try:
stats_call = next(playoff_stats(i))
if len(stats_call) > 0:
stats_dict[player_df.loc[i]['DISPLAY_FIRST_LAST']] = [stats_call[0][0], stats_call[0][1]]
except KeyError:
continue

You're asking the wrong question. The real question is: why is my program taking 30 minutes?
In other words, where is my program spending time? What is it doing that's taking so long?
You can speed up a program by using threads ONLY if these two things are true:
The program is spending a significant fraction of its time waiting on some external resource (the internet or a printer, for example)
There is something useful that it could do in another thread while it's waiting
It is far from clear whether both of those things are true in your case.
Check out the time module in the standard Python library. If you go through your code and insert print(time.time()) statements at critical points, you will quickly see where the program is spending its time. Until you figure that out, you might be totally wasting your effort by writing a threaded version.
By the way, there are more sophisticated ways to get a handle on a program's performance, but your program is so incredibly slow that a few simple print statements should point you toward a better understanding.

Firstly, as others have mentioned, your program is not particularly optimized, which should be your number one step. I would recommend debugging it using some print statements or measuring run time (How to measure time taken between lines of code in python?).
Another possible solution that is a little more brute force is concurrent.futures. This can help to run a lot of things at once, but once again it won't matter if your code isn't optimized as you'll just be running unoptimized code a lot.
This link is for web scraping, but it might be helpful.

Related

How can I measure the coverage (in production system)?

I would like to measure the coverage of my Python code which gets executed in the production system.
I want an answer to this question:
Which lines get executed often (hot spots) and which lines are never used (dead code)?
Of course this must not slow down my production site.
I am not talking about measuring the coverage of tests.
I assume you are not talking about test suite code coverage which the other answer is referring to. That is a job for CI indeed.
If you want to know which code paths are hit often in your production system, then you're going to have to do some instrumentation / profiling. This will have a cost. You cannot add measurements for free. You can do it cheaply though and typically you would only run it for short amounts of time, long enough until you have your data.
Python has cProfile to do full profiling, measuring call counts per function etc. This will give you the most accurate data but will likely have relatively high impact on performance.
Alternatively, you can do statistical profiling which basically means you sample the stack on a timer instead of instrumenting everything. This can be much cheaper, even with high sampling rate! The downside of course is a loss of precision.
Even though it is surprisingly easy to do in Python, this stuff is still a bit much to put into an answer here. There is an excellent blog post by the Nylas team on this exact topic though.
The sampler below was lifted from the Nylas blog with some tweaks. After you start it, it fires an interrupt every millisecond and records the current call stack:
import collections
import signal
class Sampler(object):
def __init__(self, interval=0.001):
self.stack_counts = collections.defaultdict(int)
self.interval = interval
def start(self):
signal.signal(signal.VTALRM, self._sample)
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
def _sample(self, signum, frame):
stack = []
while frame is not None:
formatted_frame = '{}({})'.format(
frame.f_code.co_name,
frame.f_globals.get('__name__'))
stack.append(formatted_frame)
frame = frame.f_back
formatted_stack = ';'.join(reversed(stack))
self.stack_counts[formatted_stack] += 1
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
You inspect stack_counts to see what your program has been up to. This data can be plotted in a flame-graph which makes it really obvious to see in which code paths your program is spending the most time.
If i understand it right you want to learn which parts of your application is used most often by users.
TL;DR;
Use one of the metrics frameworks for python if you do not want to do it by hand. Some of them are above:
DataDog
Prometheus
Prometheus Python Client
Splunk
It is usually done by function level and it actually depends on application;
If it is a desktop app with internet access:
You can create a simple db and collect how many times your functions are called. For accomplish it you can write a simple function and call it inside every function that you want to track. After that you can define an asynchronous task to upload your data to internet.
If it is a web application:
You can track which functions are called from js (mostly preferred for user behaviour tracking) or from web api. It is a good practice to start from outer to go inner. First detect which end points are frequently called (If you are using a proxy like nginx you can analyze server logs to gather information. It is the easiest and cleanest way). After that insert a logger to every other function that you want to track and simply analyze your logs for every week or month.
But you want to analyze your production code line by line (it is a very bad idea) you can start your application with python profilers. Python has one already: cProfile.
Maybe make a text file and through your every program method just append some text referenced to it like "Method one executed". Run the web application like 10 times thoroughly as a viewer would and after this make a python program that reads the file and counts a specific parts of it or maybe even a pattern and adds it to a variable and outputs the variables.

Is it possible to force a 2 second looping callback in Python?

I'm trying to get a looping call to run every 2 seconds. Sometimes, I get the desired functionality, but othertimes I have to wait up to ~30 seconds which is unacceptable for my applications purposes.
I reviewed this SO post and found that looping call might not be reliable for this by default. Is there a way to fix this?
My usage/reason for needing a consistent ~2 seconds:
The function I am calling scans an image (using CV2) for a dollar value and if it finds that amount it sends a websocket message to my point of sale client. I can't have customers waiting 30 seconds for the POS terminal to ask them to pay.
My source code is very long and not well commented as of yet, so here is a short example of what I'm doing:
#scan the image for sales every 2 seconds
def scanForSale():
print ("Now Scanning for sale requests")
#retrieve a new image every 2 seconds
def getImagePreview():
print ("Loading Image From Capture Card")
lc = LoopingCall(scanForSale)
lc.start(2)
lc2 = LoopingCall(getImagePreview)
lc2.start(2)
reactor.run()
I'm using a Raspberry Pi 3 for this application, which is why I suspect it hangs for so long. Can I utilize multithreading to fix this issue?
Raspberry Pi is not a real time computing platform. Python is not a real time computing language. Twisted is not a real time computing library.
Any one of these by itself is enough to eliminate the possibility of a guarantee that you can run anything once every two seconds. You can probably get close but just how close depends on many things.
The program you included in your question doesn't actually do much. If this program can't reliably print each of the two messages once every two seconds then presumably you've overloaded your Raspberry Pi - a Linux-based system with multitasking capabilities. You need to scale back your usage of its resources until there are enough available to satisfy the needs of this (or whatever) program.
It's not clear whether multithreading will help - however, I doubt it. It's not clear because you've only included an over-simplified version of your program. I would have to make a lot of wild guesses about what your real program does in order to think about making any suggestions of how to improve it.

Python - Multithreads for calling the same function to run in parallel and independently

I'm new in Python and I'm struggling a lot trying to solve a problem. I have three programs running. One of them has the objective to send data, the other to receive and the third one works in the middle (transparently). The difficulty is happening with this third one, which I'm calling delay_loss.py.
It has to emulate delay of packets before delivering them to the receiving program.
Searching a lot I have found a solution (multithreading), which I'm not sure is the best one. Since delay_loss.py can receive a lot of packets "at once" and has to select for each a random time to emulate a delay in the network, I have to be able to send each packet to the receiving program after the random time selected for this packet, independently of the others.
I'm trying to use multithread for this, and I think I'm not using it correctly because all the packets are sent at the same time after some time. The threads seem to not be running the function send_up() independently.
Part of the code of delay_loss.py is shown below:
import threading
import time
from multiprocessing.dummy import Pool as ThreadPool
...
pool = ThreadPool(window_size)
def send_up (pkt, time_delay, id_pkt):
time.sleep(time_delay)
sock_server.sendto(pkt, (C_IP, C_PORT))
def delay_pkt(pkt_recv_raw, rtt, average_delay, id_pkt):
x = random.expovariate(1/average_delay)
time_delay = rtt/(2+x)
pool.apply_async(send_up, [pkt_recv_raw, time_delay, id_pkt])
...
delay_pkt(pkt_recv_raw, rtt, average_delay, id_pkt_recv)
id_pkt_recv += 1
If anyone has some idea of what am I doing wrong. Or just to say don't take this approach of multithreads for doing this task, it would be of much help!
Thanks in advance :)
I have found a solution to my problem. I was using the pool without necessity.
It is much simpler to just use the threading.Timer() function, as shown below:
t = threading.Timer(time_delay, send_up, [pkt_recv_raw, id_pkt])
t.start()

How to advance clock and going through all the events

Reading this answer (point 2) to a question related to Twisted's task.Clock for testing purposes, I found very weird that there is no way to advance the clock from t0 to t1 while catching all the callLater calls within t0 and t1.
Of course, you could solve this problem by doing something like:
clock = task.Clock()
reactor.callLater = clock.callLater
...
def advance_clock(total_elapsed, step=0.01):
elapsed = 0
while elapsed < total_elapsed:
clock.advance(step)
elapsed += step
...
time_to_advance = 10 # seconds
advance_clock(time_to_advance)
But then we have shifted the problem toward choosing a sufficiently small step, which could be very tricky for callLater calls that sample the time from a probability distribution, for instance.
Can anybody think of a solution to this problem?
I found very weird that there is no way to advance the clock from t0 to t1 while catching all the callLater calls within t0 and t1.
Based on what you wrote later in your question, I'm going to suppose that the case you're pointing out is the one demonstrated by the following example program:
from twisted.internet.task import Clock
def foo(reactor, n):
if n == 0:
print "Done!"
reactor.callLater(1, foo, reactor, n - 1)
reactor = Clock()
foo(reactor, 10)
reactor.advance(10)
One might expect this program to print Done! but it does not. If the last line is replaced with:
for i in range(10):
reactor.advance(1)
Then the resulting program does print Done!.
The reason Clock works this way is that it's exactly the way real clocks work. As far as I know, there are no computer clocks that operate with a continuous time system. I won't say it is impossible to implement a timed-event system on top of a clock with discrete steps such that it appears to offer a continuous flow of time - but I will say that Twisted makes no attempt to do so.
The only real difference between Clock and the real reactor implementations is that with Clock you can make the time-steps much larger than you are likely to encounter in typical usage of a real reactor.
However, it's quite possible for a real reactor to get into a situation where a very large chunk of time all passes in one discrete step. This could be because the system clock changes (there's some discussion of making it possible to schedule events independent of the system clock so that this case goes away) or it could be because some application code blocked the reactor for a while (actually, application code always blocks the reactor! But in typical programs it only blocks it for a period of time short enough for most people to ignore).
Giving Clock a way to mimic these large steps makes it possible to write tests for what your program does when one of these cases arises. For example, perhaps you really care that, when the kernel decides not to schedule your program for 2.1 seconds because of a weird quirk in the Linux I/O elevator algorithm, your physics engine nevertheless computes 2.1 seconds of physics even though 420 calls of your 200Hz simulation loop have been skipped.
It might be fair to argue that the default (standard? only?) time-based testing tool offered by Twisted should be somewhat more friendly towards the common case... Or not. Maybe that would encourage people to write programs that only work in the common case and break in the real world when the uncommon (but, ultimately, inevitable) case arises. I'm not sure.
Regarding Mike's suggestion to advance exactly to the next scheduled call, you can do this easily and without hacking any internals. clock.advance(clock.getDelayedCalls()[0].getTime() - clock.seconds()) will do exactly this (perhaps you could argue Clock would be better if it at least offered an obvious helper function for this to ease testing of the common case). Just remember that real clocks do not advance like this so if your code has a certain desirable behavior in your unit tests when you use this trick, don't be fooled into thinking this means that same desirable behavior will exist in real usage.
Given that the typical use-class for Twisted is to mix hardware events and timers I'm confused why you would want to do this, but...
My understanding is that interally Twisted is tracking callLater events via a number of lists that are inside of the reactor object (See: http://twistedmatrix.com/trac/browser/tags/releases/twisted-15.2.0/twisted/internet/base.py#L437 - the xxxTimedCalls lists inside of class ReactorBase)
I haven't done any work to figure out if those lists are exposed anywhere, but if you want to take the reactors life into your own hands I'm sure you could hack your way in.
With access to the timing lists you could simply forward time to whenever the next element of the list is ... though if your trying to test code that interacts with IO events, I can't imagine this is going to do anything but confuse you...
Best of luck
Here's a function that will advance the reactor to the next IDelayedCall by iterating over reactor.getDelayedCalls. This has the problem Mike mentioned of not catching IO events, so you can specify a minimum and maximum time that it should wait, as well as a maximum time step.
def advance_through_delayeds(reactor, min_t=None, max_t=None, max_step=None):
elapsed = 0
while True:
if max_t is not None and elapsed >= max_t:
break
try:
step = min(d.getTime() - reactor.seconds() for d in reactor.getDelayedCalls())
except ValueError:
# nothing else pending
if min_t is not None and elapsed < min_t:
step = min_t - elapsed
else:
break
if max_step is not None:
step = min(step, max_step)
if max_t is not None:
step = min(step, max_t-elapsed)
reactor.advance(step)
elapsed += step
return elapsed
If you need to wait for some I/O to complete, then set min_t and max_step to reasonable values.
# wait at least 10s, advancing the reactor by no more than 0.1s at a time
advance_through_delayeds(reactor, min_t=10, max_step=0.1)
If min_t is set, it will exit once getDelayedCalls returns an empty list after that time is reached.
It's probably a good idea to always set max_t to a sane value to prevent the test suite from hanging. For example, on the above foo function by JPC it does reach the print "Done!" statement, but then would hang forever as the callback chain never completes.

How to add random delays between the queries sent to Google to avoid getting blocked in python

I have written a program which sends more than 15 queries to Google in each iteration, total iterations is about 50. For testing I have to run this program several times. However, by doing that, after several times, Google blocks me. is there any ways so I can fool google maybe by adding delays between each iteration? Also I have heard that google can actually learn the timesteps. so I need these delays to be random so google cannot find a patter from it to learn my behavior. also it should be short so the whole process doesn't take so much.
Does anyone knows something, or can provide me a piece of code in python?
Thanks
First, Google probably are blocking you because they don't like it when you take too many of their resources. The best way to fix this is to slow it down, not delay randomly. Stick a 1 second wait after every request and you'll probably stop having problems.
That said:
from random import randint
from time import sleep
sleep(randint(10,100))
will sleep a random number of seconds (between 10 and 100).
Best to use:
from numpy import random
from time import sleep
sleeptime = random.uniform(2, 4)
print("sleeping for:", sleeptime, "seconds")
sleep(sleeptime)
print("sleeping is over")
as a start and slowly decreasy range to see what works best (fastest).
Since you're not testing Google's speed, figure out some way to simulate it when doing your testing (as #bstpierre suggested in his comment). This should solve your problem and factor its variable response times out at the same time.
Also you can try to use few proxy servers for prevent ban by IP adress. urllib support proxies by special constructor parameter, httplib can use proxy too
For anyone stumbling here for the general "how to add random delay to my routine" case in 2022, numpy's recommended method [1] is to use their random number generator class:
from numpy.random import default_rng
from time import sleep
rng = default_rng()
# generates a scalar [single] value greater than or equal to 1
# but less than 3
time_to_sleep = rng.uniform(1, 3)
sleep(time_to_sleep)
[1] https://numpy.org/doc/stable/reference/random/index.html#quick-start

Categories