Problems with load RAM on 100% - python

I am conducting an experiment to load RAM at 100% on Mac OS. Stumbled upon the method described here: https://unix.stackexchange.com/a/99365
I decide to do the same. I wrote two programs which are presented below. While executing the first program, the system writes that the process takes 120 GB, but the memory usage graph is still stable. When executing the second program, almost immediately a warning pops up that the system does not have enough resources. The second program creates ten parallel processes that increase memory consumption in approximately the same way.
First program:
def load_ram(vm, timer):
x = (vm * 1024 * 1024 * 1024 * 8) * (0,)
begin_time = time.time()
while time.time() - begin_time < timer:
pass
print("end")
Memory occupied by the first program
Second program:
def load_ram(vm, timer):
file_sh = open("bash_file.sh", "w")
str_to_bash = """
VM=%d;
for i in {1..10};
do
python -c "x=($VM*1024*1024*1024*8)*(0,); import time; time.sleep(%d)" & echo "started" $i ;
done""" % (int(vm), int(timer))
file_sh.write(str_to_bash)
file_sh.close()
os.system("bash bash_file.sh")
Memory occupied by the second program
Memory occupied by the second program + system message
Parameters: vm = 16, timer = 30.
In the first program, memory takes up are equal to about 128 gigabytes (after that, a kill pops up in the terminal and the process stops). The second takes up more than 160 gigabytes, as shown in the picture. And all these ten processes are not completed. The warning that the system is low on resources is displayed even if the memory takes up are 10 gigabytes per process (that is, 100 gigabytes in total).
According to the situation described, two questions arise:
Why, with the same memory consumption (120 gigabytes), in the first case, the system pretends that this process does not exist, and in the second case it immediately falls under the same load?
Where does the number of 120 gigabytes come from if my computer's operating system contains only 16 gigabytes?
Thank you for the attention!

Related

Why does per-process overhead constantly increase for multiprocessing?

I was counting for a 6 core CPU with 12 logical CPUs in a for-loop till really high numbers several times.
To speed things up i was using multiprocessing. I was expecting something like:
Number of processes <= number of CPUs = time identical
number of processes + 1 = number of CPUs = time doubled
What i was finding was a continuous increase in time. I'm confused.
the code was:
#!/usr/bin/python
from multiprocessing import Process, Queue
import random
from timeit import default_timer as timer
def rand_val():
num = []
for i in range(200000000):
num = random.random()
print('done')
def main():
for iii in range(15):
processes = [Process(target=rand_val) for _ in range(iii)]
start = timer()
for p in processes:
p.start()
for p in processes:
p.join()
end = timer()
print(f'elapsed time: {end - start}')
print('for ' + str(iii))
print('')
if __name__ == "__main__":
main()
print('done')
result:
elapsed time: 14.9477102 for 1
elapsed time: 15.4961154 for 2
elapsed time: 16.9633134 for 3
elapsed time: 18.723183399999996 for 4
elapsed time: 21.568377299999995 for 5
elapsed time: 24.126758499999994 for 6
elapsed time: 29.142095499999996 for 7
elapsed time: 33.175509300000016 for 8
.
.
.
elapsed time: 44.629786800000005 for 11
elapsed time: 46.22480710000002 for 12
elapsed time: 50.44349420000003 for 13
elapsed time: 54.61919949999998 for 14
There are two wrong assumptions you make:
Processes are not free. Merely adding processes adds overhead to the program.
Processes do not own CPUs. A CPU interleaves execution of several processes.
The first point is why you see some overhead even though there are less processes than CPUs. Note that your system usually has several background processes running, so the point of "less processes than CPUs" is not clearcut for a single application.
The second point is why you see the execution time increase gradually when there are more processes than CPUs. Any OS running mainline Python does preemptive multitasking of processes; roughly, this means a process does not block a CPU until it is done, but is paused regularly so that other processes can run.
In effect, this means that several processes can run on one CPU at once. Since the CPU can still only do a fixed amount of work per time, all processes take longer to complete.
I don't understand what you are trying to achieve?
You are taking the same work, and running it X times, where X is the number of SMPs in your loop. You should be taking the work and dividing it by X, then sending a chunk to each SMP unit.
Anyway, with regards what you are observing - you are seeing the time it takes to spawn and close the separate processes. Python isn't quick at starting new processes.
Your test is faulty.
Imagine this, it takes 1 day for one farmer to work a 10km^2 farmland using a single tractor. If there are two farmers working 20km^2 farms, why are you expecting two farmers working twice the amount of farmlands using two tractors to take less time?
You have 6 CPU cores, your village has 6 tractors, but nobody has money to buy private tractors. As the number of workers (processes) on the village increased, the number of tractors remain the same, so everyone has to share the limited number of tractors.
In an ideal world, two farmers working twice the amount of work using two tractors would take exactly the same amount of time as one farmer working one portion of work, but in real computers the machine has other work to do even if it seems idle. There are task switching, the OS kernel has to run and monitor hardware devices, memory caches needs to be flushed and invalidated between CPU cores, your browser needs to run, the village elder is running a meeting to discuss who should get the tractors and when, etc.
As the number of workers increases beyond the number of tractors, farmers don't just hog the tractors for themselves. Instead they made an arrangement that they'd pass the tractors around every three hours or so. This means that the seventh farmer don't have to wait for two days to get their share of tractor time. However, there's a cost to transferring tractors between farmlands, just as there are costs for CPU to switch between processes; task switch too frequently and the CPU is not actually doing work, and switch to infrequently and you get resource starvation as some jobs takes too long to start being worked on.
A more sensible test would be to keep the size of farmland constant and just increase the number of farmers. In your code, that would correspond to this change:
def rand_val(num_workers):
num = []
for i in range(200000000 / num_workers):
num = random.random()
print('done')
def main():
for iii in range(15):
processes = [Process(target=lambda: rand_val(iii)) for _ in range(iii)]
...

How do I take advantage of parallel processing in gcloud?

Google Cloud offers a virtual machine instance with 96 cores.
I thought that having those 96 cores you would be able to divide your program into 95 slices while leaving one other core to run the program and you could thus run the program 95 times faster.
It's not working out that way however.
I'm running a simple program in Python which simply counts to 20 million.
On my Mac book it takes 4.6 seconds to run this program in serial, and 1.2 seconds when I divide this process into 4 sections and run it in parallel.
My Mac book has the following specs:
version 10.14.5
processor 2.2 GHz Intel Core i7
Mid 2015
Memory 16GB 1600 MHz DDr3
The computer that gcloud offers basically is just a tad bit slower.
When I run the program in serial it takes the same amount of time. When I divide the program into 95 division it actually takes more time: 2.6 seconds. When I divide the program into 4, it takes 1.4 seconds to complete. The computer has the following specs:
n1-highmem-96 (96 vCPUs, 624 GB memory)
I need the high memory because I'm going to index a corpus of 14 billion words. I'm going to save the locations of the 100,000 most frequent words.
When I then save those locations, I'm going to pickle them.
Pickling files takes up an enormous amount of time and will probably consume 90% of the time.
It is for this reason that I need to pickle the files as little as often. So if I can put the locations into python objects for as long as possible then I will save myself a lot of time.
Here is the python program I'm using just in case it helps:
import os, time
p = print
en = enumerate
def count_2_million(**kwargs):
p('hey')
start = kwargs.get("start")
stop = kwargs.get("stop")
fork_num = kwargs.get('fork_num')
for x in range(start, stop):
pass
b = time.time()
c = round(b - kwargs['time'], 3)
p(f'done fork {fork_num} in {c} seconds')
def divide_range(divisions: int, total: int, idx: int, begin=0):
sec = total // divisions
start = (idx * sec) + begin
if total % divisions != 0 and idx == divisions:
stop = total
else:
stop = start + sec
return start, stop
def main_fork(func, total, **kwargs):
forks = 4
p(f'{forks} forks')
kwargs['time'] = time.time()
pids = []
for i in range(forks):
start1, stop1 = 0, 0
if total != -1:
start1, stop1 = divide_range(4, total, i)
newpid = os.fork()
pids.append(newpid)
kwargs['start'] = start1
kwargs['stop'] = stop1
kwargs['fork_num'] = i
if newpid == 0:
p(f'fork num {i} {start1} {stop1}')
child(func, **kwargs)
return pids
def child(func, **kwargs):
func(**kwargs)
os._exit(0)
main_fork(count_2_million, 200_000_000, **{})
In your specific use case I think one of the solutions is to use Clusters.
There are two main types of cluster computing workloads, in your case since you need to index 14 billion words you will need to use High-throughput computing (HTC).
What is High-throughput Computing?
A type of computing where apps have multiple tasks that are processed independently of each other without a need for the individual compute nodes to communicate. Sometimes these workloads are called embarrassingly parallel or batch workloads
When I run the program in serial it takes the same amount of time. When I divide the program into 95 division it actually takes more time: 2.6 seconds. When I divide the program into 4, it takes 1.4 seconds to complete.
For this part you should check the part of the documentation with recommended architectures and best practices so you ensure that you have the best setup to get the most of the job you want to be done.
There are some open source software like ElastiCluster that provide cluster management and support for provisioning nodes while using Google Compute Engine.
After the cluster is operational you can use HTCondor to analyze and manage the cluster nodes.

Why does running this script freeze my computer?

I wrote a script in Python using SciPy to perform a short-time Fourier transform on a signal. When I ran it on a signal with a thousand timepoints, it ran fine. When I ran it on a signal with a million timepoints, it froze my computer (computer doesn't respond, and if audio was playing, the computer outputs a skipping and looping buzz); this has consistently occurred all three times I attempted it. I've written scripts that would take hours, but I've never encountered one that actually froze my computer. Any idea why? The script is posted below:
import scipy as sp
from scipy import fftpack
def STFT(signal, seconds_per_sample, window_seconds, min_Hz):
window_samples = int(window_seconds/seconds_per_sample) + 1
signal_samples = len(signal)
if signal_samples <= window_samples:
length = max(signal_samples, int(1/(seconds_per_sample*min_Hz)) + 1)
return sp.array([0]), fftpack.fftshift(fftpack.fftfreq(length, seconds_per_sample)), fftpack.fftshift(fftpack.fft(signal, n = length))
else:
length = max(window_samples, int(1/(seconds_per_sample*min_Hz)) + 1)
frequency = fftpack.fftshift(fftpack.fftfreq(length, seconds_per_sample))
time = []
FTs = []
for i in range(signal_samples - window_samples):
time.append(seconds_per_sample*i)
FTs.append(fftpack.fftshift(fftpack.fft(signal[i:i + window_samples], n = length)))
return sp.array(time), frequency, sp.array(FTs)
in the script is consumed too much RAM when you run it over a too large number of points, see Why does a simple python script crash my system
The process in that your program runs stores the arrays and variables for the calculations in process memory which is ram
you can fix this by forcing the program to use hard disk memory.
For workarounds (shelve,...) see the following links
memory usage, how to free memory
Python large variable RAM usage
I need to free up RAM by storing a Python dictionary on the hard drive, not in RAM. Is it possible?

Python list append timings

cat /proc/meminfo
MemTotal: 3981272 kB
I ran this simple test in python
#!/usr/bin/env python
import sys
num = int(sys.argv[1])
li = []
for i in xrange(num):
li.append(i)
$ time ./listappend.py 1000000
real 0m0.342s
user 0m0.304s
sys 0m0.036s
$ time ./listappend.py 2000000
real 0m0.646s
user 0m0.556s
sys 0m0.084s
$ time ./listappend.py 4000000
real 0m1.254s
user 0m1.136s
sys 0m0.116s
$ time ./listappend.py 8000000
real 0m2.424s
user 0m2.176s
sys 0m0.236s
$ time ./listappend.py 16000000
real 0m4.832s
user 0m4.364s
sys 0m0.452s
$ time ./listappend.py 32000000
real 0m9.737s
user 0m8.637s
sys 0m1.028s
$ time ./listappend.py 64000000
real 0m56.296s
user 0m17.797s
sys 0m3.180s
Question:
The time for 64000000 is 6 times more than the time for 32000000 but before that the times were simply doubling. Why so ?
TL;DR - Due to RAM being insufficient & the memory being swapped out to secondary storage.
I ran the program with different sizes on my box. Here are the results
/usr/bin/time ./test.py 16000000
2.90user 0.26system 0:03.17elapsed 99%CPU 513480maxresident
0inputs+0outputs (0major+128715minor)pagefaults
/usr/bin/time ./test.py 32000000
6.10 user 0.49 system 0:06.64 elapsed 99%CPU 1022664maxresident
40inputs (2major+255998minor)pagefaults
/usr/bin/time ./test.py 64000000
12.70 user 0.98 system 0:14.09 elapsed 97%CPU 2040132maxresident
4272inputs (22major+510643minor)pagefaults
/usr/bin/time ./test.py 128000000
30.57 user 23.29 system 27:12.32 elapsed 3%CPU 3132276maxresident
19764880inputs (389184major+4129375minor)pagefaults
User time the time the program ran as the user. (running user logic)
System time the time the program executed as the system. (i.e., time spent in system calls)
Elapsed time The total time the program executed. (includes waiting time..)
Elapsed time = User time + System Time + time spent waiting
Major Page Fault Occurs when a page of memory isn't in RAM & has to be fetched from a secondary device like a Hard Disk.
16M list size: list is mostly in memory. Hence no page faults.
32M list size: parts of list has to be swapped out of memory. Hence the little bit bump from exact two fold increase of elapsed time .
64M list size: increase in elapsed time is more than two fold due to 22 major pagefaults.
128M list size: The elapsed time has increased from 14 sec to over 27 minutes !! The waiting time is almost 26 minutes. This is due to a huge number of pagefaults (389184). Also notice the CPU usage is down to 3% from 99% due the the massive waiting time.
As unutbu pointed out python interpreter allocating a O(n*n) extra space for lists as they grow the situation is only worsened.
According to effbot:
The time needed to append an item to the list is “amortized constant”;
whenever the list needs to allocate more memory, it allocates room for
a few items more than it actually needs, to avoid having to reallocate
on each call (this assumes that the memory allocator is fast; for huge
lists, the allocation overhead may push the behaviour towards O(n*n)).
(my emphasis).
As you append more items to the list, the reallocator will try to reserve ever-larger amounts of memory. Once you've consumed all your physical memory (RAM) and your OS starts using swap space, the shuffling of data from disk to RAM or vice versa will make your program very slow.
I strongly suspect your Python process runs out of physical RAM available to it, and starts swapping to disk.
Re-run the last test while keeping an eye on its memory usage and/or the number of page faults.

Reducing system time usage of Python Subprocess

I have a python script that uses multiprocessing's pool.map( ... ) to run a large number of calculations in parallel. Each of these calculations consists of the python script setting up input for a fortran program, using subprocess.popen( ... , stdin=PIPE, stdout=PIPE, stderr=PIPE ) to run the program, dump the input to it and read the output. Then the script parses the output, gets the needed numbers, then does it again for the next run.
def main():
#Read a configuration file
#do initial setup
pool = multiprocessing.Pool(processes=maxProc)
runner = CalcRunner( things that are the same for each run )
runNumsAndChis = pool.map( runner, xrange(startRunNum, endRunNum))
#dump the data that makes it past a cut to disk
class CalcRunner(object):
def __init__(self, stuff):
#setup member variables
def __call__(self, runNumber):
#get parameters for this run
params = self.getParams(runNum)
inFileLines = []
#write the lines of the new input file to a list
makeInputFile(inFileLines, ... )
process = subprocess.Popen(cmdString, bufsize=0, stdin=subprocess.PIPE, ... )
output = proc.communicate( "".join(inFileLines) )
#get the needed numbers from stdout
chi2 = getChiSq(output[0])
return [runNumber, chi2]
...
Anyways, on to the reason for the question. I submit this script to a grid engine system to break this huge parameter space sweep into 1000, 12 core (I choose 12 since most of the grid is 12 cores), tasks. When a single task runs on a single 12 core machine about 1/3 of the machine's time is spent doing system stuff, and the other 2/3 of the time is doing the user calculations, presumably setting up inputs to ECIS (the aforementioned FORTRAN code), running ECIS, and parsing the output of ECIS. However, sometimes 5 tasks get sent to a 64 core machine to utilize 60 of its cores. On that machine 40% of the time is spent doing system stuff and 1-2% doing user stuff.
First of all, where are all the system calls coming from? I tried writing a version of the program that runs ECIS once per separate thread and keeps piping new input to it and it spends FAR more time in system (and is slower overall), so it doesn't seem like it is due to all the process creation and deletion.
Second of all, how do I go about decreasing the amount of time spent on system calls?
At a guess, the open a process once and keep sending input to it was slower because I had to turn gfortran's output buffering off to get anything from the process, nothing else worked (short of modifying the fortran code... which isn't happening).
The OS on my home test machines where I developed this is Fedora 14. The OS on the grid machines is a recent version of Red Hat.
I have tried playing around with bufsize, setting it to -1 (system defaults), 0 (unbuffered), 1 (line by line), and 64Kb that does not seem to change things.

Categories