I am writing a 3D engine in Python using Pygame. For the first 300 or so frames, it takes from about 0.004 to 0.006 seconds to render 140 polygons. But after that, it suddenly takes an average of about 0.020 seconds to do the same task. This is concerning to me because this is a small-scale test, and even though 50 FPS is decent, it cannot be sustained at 1000 polygons, for instance.
I have already done a lot of streamlining to my code. I have also done some slightly deeper profiling, and it appears that the increased time is more or less proportionately distributed, which suggests that the problem is not specific to a single piece of code.
I assume that the problem has something to do with memory usage, but I do not know exactly why this problem is happening. What specific issue is causing this to happen, and how can I optimize my code to fix it, as well as some more general practices? Since my code is very long, it is posted here.
Although I can't answer your question exactly, I would use a task manager and watch the "python" (or "pygame" depending on your OS) process, and view it's memory consumption. If that turns out to be the issue, you could check to see what variables you don't need after a certain time, and then you could clear those variables.
Edit: Some CPUs have data loss prevention systems. What I mean by this is this:
If Application X takes up 40ish % of the CPU (it doesn't have to be that high). After a certain amount of time, the CPU will throttle the amount of CPU that Application X is allowed to use. This can cause slowdown for things such as this. This doesn't happen with (most) games because they're set up to tell the CPU to expect that amount of strain.
Related
I have python program which works on aws-lambda.
It sometimes works and sometimes not (about one time in 20 times)
No error message, nor no signal.
So I guesss this is because of memory... maybe.
For testing purpose,
I want to make the program which wastes the memory on purpose.
I searched around but can't find the good sample or explanation.
Almost all samples are for not wasting memory
Is there any good way to do this?
With Python's multiprocessing, would it make sense to have a Pool with a bunch of ThreadPools within them? Say I have something like:
def task(path):
# i/o bound
image = load(path)
# cpu bound but only takes up 1/10 of the time of the i/o bound stuff
image = preprocess(img)
# i/o bound
save(image, path)
Then I'd want to process a list of paths path_list. If I use ThreadPool I still end up hitting a ceiling because of the cpu bound bit. If I use a Pool I spend too much dead time waiting for i/o. So wouldn't it be best to split path_list over multiple processes that each in turn use multiple threads?
Another shorter way of restating my example is what if I have a method that should be multithreaded because it's i/o bound but I also want to make use of many cpu cores? If I use a Pool I'm using each core up for a single task which is i/o bound. If I use a ThreadPool I only get to use one core.
Does it make sense
Yes. Let's say you start with one process and one thread. Because some parts of the code block on IO, the process will utilize less than a 100% CPU - so we start adding threads. As long as we see an increase in task throughput, it means the CPU is our bottleneck. At some point, we might hit 100% CPU utilization in our process. Because of the GIL, a pure python process can utilize up to 100% CPU. But, as far as we know, the CPU might still be our bottleneck, and the only way to gain more CPU time is to create another process (or use subinterpreters, but let's ignore that for now).
In summary, this is a valid approach for increasing throughput of pure-python tasks that both utilize CPU and block on IO. But, it does not mean that it is a good approach in your case. First, your bottleneck might be the disk and not the CPU, in which case you don't need more CPU time, which means you don't need more processes. Second, even if the CPU is the bottleneck, multithreading within multiprocessing is not necessarily the simplest solution, the most performant solution, or the winning solution in other resource utilization metrics such as memory usage.
For example, if simplicity is your top priority, you could get all the CPU time you need just by using processes. This solution is easier to implement, but is heavy in terms of memory usage. Or, for example, if your goal is to achieve maximal performance and minimal memory utilization, then you you probably want to replace the threads with an IO loop and use a process pool executor for your CPU-bound tasks. Squeezing maximal performance from your hardware is not an easy task. Below is a methodology that I feel had served me well.
Aiming towards maximal performance
From now on, I'm assuming your goal is to make maximal use of your hardware in order to achieve a maximal throughput of "tasks". In that case, the final solution depends on your hardware, so you'll need to get to know it a little bit better. To try and reach your performance goals, I recommend to:
Understand your hardware utilization
Identify the bottleneck and estimate the maximal throughput
Design a solution to achieve that throughput
Implement the design, and optimize until you meet your requirements
In detail:
1. Understand your hardware utilization
In this case, there are a few pieces of hardware involved:
The RAM
The disk
The CPU
Let's look at one "task" and note how it uses the hardware:
Disk (read)
RAM (write)
CPU time
RAM (read)
Disk (write)
2. Identify the bottleneck and estimate the maximal throughput
To identify the bottleneck, let us calculate the maximum throughput of tasks that each hardware component can provide, assuming usage of them can be completely parallelized. I like to do that using python:
(note that I'm using random constants, you'll have to fill in the real data for your setup in order to use it).
# ----------- General consts
input_image_size = 20 * 2 ** 20 # 20MB
output_image_size = 15 * 2 ** 20 # 15MB
# ----------- Disk
# If you have multiple disks and disk access is the bottleneck, you could split the images between them
amount_of_disks = 2
disk_read_rate = 3.5 * 2 ** 30 # 3.5GBps, maximum read rate for a good SSD
disk_write_rate = 2.5 * 2 ** 30 # 2.5GBps, maximum write rate for a good SSD
disk_read_throughput = amount_of_disks * disk_read_rate / input_image_size
disk_write_throughput = amount_of_disks * disk_write_rate / output_image_size
# ----------- RAM
ram_bandwidth = 30 * 2 ** 30 # Assuming here similar write and read rates of 30GBps
# assuming you are working in userspace and not using a userspace filesystem,
# data is first read into kernel space, then copied to userspace. So in total,
# two writes and one read.
userspace_ram_bandwidth = ram_bandwidth / 3
ram_read_throughput = userspace_ram_bandwidth / input_image_size
ram_write_throughput = userspace_ram_bandwidth / output_image_size
# ----------- CPU
# We decrease one core, as at least some scheduling code and kernel code is going to run
core_amount = 8 - 1
# The measured amount of times a single core can run the preprocess function in a second.
# Assuming that you are not planning to optimize the preprocess function as well.
preprocess_function_rate = 1000
cpu_throughput = core_amount * preprocess_function_rate
# ----------- Conclusions
min_throughput, bottleneck_name = min([(disk_read_throughput, 'Disk read'),
(disk_write_throughput, 'Disk write'),
(ram_read_throughput, 'RAM read'),
(ram_write_throughput, 'RAM write'),
(cpu_throughput, 'CPU')])
cpu_cores_needed = min_throughput / preprocess_function_rate
print(f'Throughput: {min_throughput:.1f} tasks per second\n'
f'Bottleneck: {bottleneck_name}\n'
f'Worker amount: {cpu_cores_needed:.1f}')
This code outputs:
Throughput: 341.3 tasks per second
Bottleneck: Disk write
Worker amount: 0.3
That means:
The maximum rate we can achieve is around 341.3 tasks per second
The disk is the bottleneck. You might be able to increase your performance by, for example:
Buying more disks
Using ramfs or a similar solution that avoids using the disk altogether
In a system where all the steps in task are executed in parallel, you won't need to dedicate more than one core for running preprocess. (In python that means you'll probably need only one process, and threads or asyncio would suffice to achieve concurrency with other steps)
Note: the numbers are lying
This kind of estimation is very hard to get right. It's hard not to forget things in the calculation itself, and hard to achieve good measurements for the constants. For example, there is a big issue with the current calculation - reads and writes are not orthogonal. We assume in our calculation that everything is happening in parallel, so constants like disk_read_rate have to account for writes occurring simultaneously to the reads. The RAM rates should probably be decreased by at least 50%.
3. Design a solution to achieve that throughput
Similarly to what you'd offered in your question, my initial design would be something like:
Have a pool of workers load the images and send them on a queue to the next step (we'll need to be reading using multiple cores to use all available memory bandwidth)
Have a pool of workers process the images and send the results on a queue (the amount of workers should be chosen according to the output of the script above. For the current result, the number is 1)
Have a pool of workers save the processed images to the disk.
The actual implementation details will vary according to different technical constraints and overheads you will run into while implementing the solution. Without further details and measurements it is hard to guess what they will be exactly.
4. Implement the design, and optimize until you meet your requirements
Good luck, and be warned that even if you did a good job at estimating the maximal throughput, it might be very hard to get there. Comparing the maximum rate to your speed requirements might give you a good idea of the amount of effort needed. For example, if the rate you need is 10x slower than the maximum rate, you might be done pretty quickly. But if it is only 2x slower, you might want to consider doubling your hardware and start preparing for some hard work :)
kmarok's answer is good technical one. But, I would also consider the quote "Premature optimization is the root of all evil" concept.
In short, yes, it make sense. But, do you really need to?
Optimization is a trade off. You compromise code simplicity for better performance. Code simplicity is important; you'll need to further develop, debug, and test your software in the future. This will cost you in time. Simplicity buys you time. You need to be aware of the trade-off when you optimize.
I would first write a multithreaded version and measure it using your hardware.
Then I would try the multiprocessing version, and measure it too.
Does any of the versions, is good enough? It might be. If so, you just made your software simpler, more readable and better maintainable.
Chen's and Kamaork's answers resume most of what is needed to know, but there are 2 missing ideas:
Your code will be A process and not THE process, this means that you need to account of how much resources you have left and not how many you can have (it can even happen within your process, threads are not ilimited); this deadly problem happend to me leaving me with less than half of a celeron for a gui, not good.
The biggest optimization with threads you can do is "prediction" (this refers more specifically to when stuff happens), you can chain the threads in a better way when you know how much it takes to compite and its a consisten wait, reading about the tcp window may give you a better idea of how a delay can be optimized by expecting it and not by forcing it.
This is a rather general question:
I am having issues that the same operation measured by time.clock() takes longer now than it used to.
While I had very some very similar measurements
1954 s
1948 s
1948 s
One somewhat different measurement
1999 s
Another even more different
2207 s
It still seemed more or less ok, but for another one I get
2782 s
And now that I am repeating the measurements, it seems to get slower and slower.
I am not summing over the measurements after rounding or doing other weird manipulations.
Do you have some ideas whether this could be affected by how busy the server is, the clock speed or any other variable parameters? I was hoping that using time.clock() instead of time.time() would mostly sort these out...
The OS is Ubuntu 18.04.1 LTS.
The operations are run in separate screen sessions.
The operations do not involve hard-disk acccess.
The operations are mostly numpy operations that are not distributed. So this is actually mainly C code being executed.
EDIT: This might be relevant: The measurements in time.time() and time.clock() are very similar in any of the cases. That is time.time() measurements are always just slightly longer than time.clock(). So if I haven't missed something, the cause has almost exactly the same effect on time.clock() as on time.time().
EDIT: I do not think that my question has been answered. Another reason I could think of is that garbage collection contributes to CPU usage and is done more frequently when the RAM is full or going to be full.
Mainly, I am looking for an alternative measure that gives the same number for the same operations done. Operations meaning my algorithm executed with the same start state. Is there a simple way to count FLOPS or similar?
The issue seems to be related to Python and Ubuntu.
Try the following:
Check if you have the latest stable build of the python version you're using Link 1
Check the process list, also see which cpu core your python executable is running on.
Check the thread priority state on the cpu Link 2
, Link 3
Note:
Time may vary due to process switching, threading and other OS resource management and application code execution (this cannot be controlled)
Suggestions:
it could be because of your systems build, try running your code on another machine or on a Virtual Machine.
Read Up:
Link 4
Link 5
Good Luck.
~ Dean Van Geunen
As a result of repeatedly running the same algorithm at different 'system states', I would summarize that the answer to the question is:
Yes, time.clock() can be heavily affected by the state of the system.
Of course, this holds all the more for time.time().
The general reasons could be that
The same Python code does not always result in the same commands being sent to the CPU - that is the commands depend not only on the code and the start state, but also on the system state (i.e. garbage collection)
The system might interfere with the commands sent from Python, resulting in additional CPU usage (i.e. by core switching) that is still counted by time.clock()
The divergence can be very large, in my case around 50%.
It is not clear which are the specific reasons, nor how much each of them contributes to the problem.
It is still to be tested whether timeit helps with some or all of the above points. However timeit is meant for benchmarking and might not be recommended to be used during normal processing. It turns off garbage collection and does not allow accessing return values of the timed function.
For the past few weeks I've been attempting to preform a fairly large clustering analysis using the HDBSCAN algorithm in python 3.7. The data in question is roughly 4 million rows by 40 columns at around 1.5GB in CSV format. It's a mixture of ints, bools, and floats up to 9 digits.
During this period each time I've been able to get the data to cluster it has taken 3 plus days, which seems weird given HDBSCAN is revered for its speed and I'm running this on a Google Cloud Compute Instance with 96 cpus. I've spent days trying to get it to utilize the cloud instance's processing power but to no avail.
Using the auto algorithm detection in HDBSCAN, it selects the boruvka_kdtree as the best algorithm to use. And I've tried passing in all sorts of values to core_dist_n_jobs parameter. From -2,-1, 1, 96, multiprocessing.cpu_count(), to 300. All seem to have a similar effect of causing 4 main python processes to utilize a full core while spawning way more sleeping processes.
I refuse to believe I'm doing this right and this is truly how long this takes on this hardware. I'm convinced I must be missing something like an issue where using JupyterHub on the same machine causes some sort of GIL lock, or I'm missing some parameter for HDBSCAN.
Here is my current call to HDBSCAN:
hdbscan.HDBSCAN(min_cluster_size = 100000, min_samples = 500, algorithm='best', alpha=1.0, memory=mem, core_dist_n_jobs = multiprocessing.cpu_count())
I've followed all existing issues and posts related to this issue I could find and nothing has worked so far, but I'm always down to try even radical ideas, because this isn't even the full data I want to cluster and at this rate it would take 4 years years to cluster the full data!
According to the author
Only the core distance computation can use all the cores, sadly that is apparently the first few seconds. The rest of the computation is quite challenging to parallelise unfortunately and will run on a single thread.
you can read the issues from the links below:
Not using all available CPUs?
core_dist_n_jobs =1 or -1 -> no difference at all and computation time extremely high
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've completed writing a multiclass classification algorithm that uses boosted classifiers. One of the main calculations consists of weighted least squares regression.
The main libraries I've used include:
statsmodels (for regression)
numpy (pretty much everywhere)
scikit-image (for extracting HoG features of images)
I've developed the algorithm in Python, using Anaconda's Spyder.
I now need to use the algorithm to start training classification models. So I'll be passing approximately 7000-10000 images to this algorithm, each about 50x100, all in gray scale.
Now I've been told that a powerful machine is available in order to speed up the training process. And they asked me "am I using GPU?" And a few other questions.
To be honest I have no experience in CUDA/GPU, etc. I've only ever heard of them. I didn't develop my code with any such thing in mind. In fact I had the (ignorant) impression that a good machine will automatically run my code faster than a mediocre one, without my having to do anything about it. (Apart from obviously writing regular code efficiently in terms of loops, O(n), etc).
Is it still possible for my code to get speeded up simply by virtue of being on a high performance computer? Or do I need to modify it to make use of a parallel-processing machine?
The comments and Moj's answer give a lot of good advice. I have some experience on signal/image processing with python, and have banged my head against the performance wall repeatedly, and I just want to share a few thoughts about making things faster in general. Maybe these help figuring out possible solutions with slow algorithms.
Where is the time spent?
Let us assume that you have a great algorithm which is just too slow. The first step is to profile it to see where the time is spent. Sometimes the time is spent doing trivial things in a stupid way. It may be in your own code, or it may even be in the library code. For example, if you want to run a 2D Gaussian filter with a largish kernel, direct convolution is very slow, and even FFT may be slow. Approximating the filter with computationally cheap successive sliding averages may speed things up by a factor of 10 or 100 in some cases and give results which are close enough.
If a lot of time is spent in some module/library code, you should check if the algorithm is just a slow algorithm, or if there is something slow with the library. Python is a great programming language, but for pure number crunching operations it is not good, which means most great libraries have some binary libraries doing the heavy lifting. On the other hand, if you can find suitable libraries, the penalty for using python in signal/image processing is often negligible. Thus, rewriting the whole program in C does not usually help much.
Writing a good algorithm even in C is not always trivial, and sometimes the performance may vary a lot depending on things like CPU cache. If the data is in the CPU cache, it can be fetched very fast, if it is not, then the algorithm is much slower. This may introduce non-linear steps into the processing time depending on the data size. (Most people know this from the virtual memory swapping, where it is more visible.) Due to this it may be faster to solve 100 problems with 100 000 points than 1 problem with 10 000 000 points.
One thing to check is the precision used in the calculation. In some cases float32 is as good as float64 but much faster. In many cases there is no difference.
Multi-threading
Python - did I mention? - is a great programming language, but one of its shortcomings is that in its basic form it runs a single thread. So, no matter how many cores you have in your system, the wall clock time is always the same. The result is that one of the cores is at 100 %, and the others spend their time idling. Making things parallel and having multiple threads may improve your performance by a factor of, e.g., 3 in a 4-core machine.
It is usually a very good idea if you can split your problem into small independent parts. It helps with many performance bottlenecks.
And do not expect technology to come to rescue. If the code is not written to be parallel, it is very difficult for a machine to make it parallel.
GPUs
Your machine may have a great GPU with maybe 1536 number-hungry cores ready to crunch everything you toss at them. The bad news is that making GPU code is a bit different from writing CPU code. There are some slightly generic APIs around (CUDA, OpenCL), but if you are not accustomed to writing parallel code for GPUs, prepare for a steepish learning curve. On the other hand, it is likely someone has already written the library you need, and then you only need to hook to that.
With GPUs the sheer number-crunching power is impressive, almost frightening. We may talk about 3 TFLOPS (3 x 10^12 single-precision floating-point ops per second). The problem there is how to get the data to the GPU cores, because the memory bandwidth will become the limiting factor. This means that even though using GPUs is a good idea in many cases, there are a lot of cases where there is no gain.
Typically, if you are performing a lot of local operations on the image, the operations are easy to make parallel, and they fit well a GPU. If you are doing global operations, the situation is a bit more complicated. A FFT requires information from all over the image, and thus the standard algorithm does not work well with GPUs. (There are GPU-based algorithms for FFTs, and they sometimes make things much faster.)
Also, beware that making your algorithms run on a GPU bind you to that GPU. The portability of your code across OSes or machines suffers.
Buy some performance
Also, one important thing to consider is if you need to run your algorithm once, once in a while, or in real time. Sometimes the solution is as easy as buying time from a larger computer. For a dollar or two an hour you can buy time from quite fast machines with a lot of resources. It is simpler and often cheaper than you would think. Also GPU capacity can be bought easily for a similar price.
One possibly slightly under-advertised property of some cloud services is that in some cases the IO speed of the virtual machines is extremely good compared to physical machines. The difference comes from the fact that there are no spinning platters with the average penalty of half-revolution per data seek. This may be important with data-intensive applications, especially if you work with a large number of files and access them in a non-linear way.
I am afraid you can not speed up your program by just running it on a powerful computer. I had this issue while back. I first used python (very slow), then moved to C(slow) and then had to use other tricks and techniques. for example it is sometimes possible to apply some dimensionality reduction to speed up things while having reasonable accurate result, or as you mentioned using multi processing techniques.
Since you are dealing with image processing problem, you do a lot of matrix operations and GPU for sure would be a great help. there are some nice and active cuda wrappers in python that you can easily use, by not knowing too much CUDA. I tried Theano, pycuda and scikit-cuda (there should be more than that since then).