use xargs and keep CPU load of remote server to minimum - python

I used xargs for parallel processing. This thread follows what I had. Parallel processing with xargs in bash
But the parallel processing increased CPU load. I was running the script for 57 openstack tenants to fetch results from them. This report is running every 2 hours which is causing CPU spike.
To decrease the load, I thought of adding the random sleep time...something like below but it didn't do much help. I am not sure if I can use NICE to set a priority because the results I am fetching are from openstack server. If it's possible, please let me know.
I could remove the parallel processing but this will take 6 hours for me to get all the reports from tenants. So that is not an option.
If there is any other way to optimize this...any ideas or suggestions would be great.
source ../creds/base
printf '%s\n' A B C D E F G H I J K L M N O P |
xargs -n 3 -P 8 bash -c 'for tenant; do
source ../creds/"$tenant"
python ../tools/openstack_resource_list.py "$tenant"> ./reports/openstack_reports/"$tenant".html
sleep $[ ( $RANDOM % 10 ) + 1 ]s
done' _
Python file
with open('../floating_list/testReports/'+tenant_file+'.csv', 'wb') as myfile:
fields = ['Name', 'Volume', 'Flavor', 'Image', 'Floating IP']
writer = csv.writer(myfile)
writer.writerow(fields)
try:
for i in servers:
import io, json
with open('../floating_list/testReports/'+tenant_file+'.json', 'w') as e:
e.write(check_output(['openstack', 'server', 'show', i.name, '-f', 'json']))
with open('../floating_list/testReports/'+tenant_file+'.json', 'r') as a:
data = json.load(a)
name.append(i.name)
volume = data.get('os-extended-volumes:volumes_attached', None)
if volume:
vol = [d.get('id', {}) for d in volume if volume]
vol_name_perm = []
for i in vol:
try:
vol_name1 = (check_output(['openstack', 'volume', 'show', i, '-c', 'name', '-f', 'value'])).rstrip()
vol_name_perm.append(vol_name1)
except:
vol_name_perm.append('Error')
vol_join = ','.join(vol_name_perm)
vol_name.append(vol_join)
else:
vol_name.append('None')
...
zipped = [(name), (vol_name),(flavor),(image),(addr)]
result = zip(*zipped)
for i in result:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(i)
except (CalledProcessError, IndexError) as e:
print (e)
print("except", i.name)
WITH GNU PARALLEL
printf '%s\n' A B C D E F | parallel --eta -j 2 --load 40% --noswap 'for tenant; do
source ../creds/"$tenant"
python ../tools/openstack_resource_list.py "$tenant"> ./reports/openstack_reports/"$tenant".html
done'
I get syntax error near unexpected token `A'
WORKAROUND
I am able to manage the load with xargs -n 1 -p 3 for now. This gives me reports within 2 hours. I still want to explore my options with GNU Parallel as suggested by Ole Tange

Maybe you can use GNU Parallel with niceload (part of GNU Parallel):
niceload parallel --nice 11 --bar "'source ../creds/{};
python ../tools/openstack_resource_list.py {} > ./reports/openstack_reports/{}.html'" ::: {A..F}

Related

Python Script which can get instance(Server) level info... For ex. Memory, CPU etc [duplicate]

How can I get the current system status (current CPU, RAM, free disk space, etc.) in Python? Ideally, it would work for both Unix and Windows platforms.
There seems to be a few possible ways of extracting that from my search:
Using a library such as PSI (that currently seems not actively developed and not supported on multiple platforms) or something like pystatgrab (again no activity since 2007 it seems and no support for Windows).
Using platform specific code such as using a os.popen("ps") or similar for the *nix systems and MEMORYSTATUS in ctypes.windll.kernel32 (see this recipe on ActiveState) for the Windows platform. One could put a Python class together with all those code snippets.
It's not that those methods are bad but is there already a well-supported, multi-platform way of doing the same thing?
The psutil library gives you information about CPU, RAM, etc., on a variety of platforms:
psutil is a module providing an interface for retrieving information on running processes and system utilization (CPU, memory) in a portable way by using Python, implementing many functionalities offered by tools like ps, top and Windows task manager.
It currently supports Linux, Windows, OSX, Sun Solaris, FreeBSD, OpenBSD and NetBSD, both 32-bit and 64-bit architectures, with Python versions from 2.6 to 3.5 (users of Python 2.4 and 2.5 may use 2.1.3 version).
Some examples:
#!/usr/bin/env python
import psutil
# gives a single float value
psutil.cpu_percent()
# gives an object with many fields
psutil.virtual_memory()
# you can convert that object to a dictionary
dict(psutil.virtual_memory()._asdict())
# you can have the percentage of used RAM
psutil.virtual_memory().percent
79.2
# you can calculate percentage of available memory
psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
20.8
Here's other documentation that provides more concepts and interest concepts:
https://psutil.readthedocs.io/en/latest/
Use the psutil library. On Ubuntu 18.04, pip installed 5.5.0 (latest version) as of 1-30-2019. Older versions may behave somewhat differently.
You can check your version of psutil by doing this in Python:
from __future__ import print_function # for Python2
import psutil
print(psutil.__versi‌​on__)
To get some memory and CPU stats:
from __future__ import print_function
import psutil
print(psutil.cpu_percent())
print(psutil.virtual_memory()) # physical memory usage
print('memory % used:', psutil.virtual_memory()[2])
The virtual_memory (tuple) will have the percent memory used system-wide. This seemed to be overestimated by a few percent for me on Ubuntu 18.04.
You can also get the memory used by the current Python instance:
import os
import psutil
pid = os.getpid()
python_process = psutil.Process(pid)
memoryUse = python_process.memory_info()[0]/2.**30 # memory use in GB...I think
print('memory use:', memoryUse)
which gives the current memory use of your Python script.
There are some more in-depth examples on the pypi page for psutil.
Only for Linux:
One-liner for the RAM usage with only stdlib dependency:
import os
tot_m, used_m, free_m = map(int, os.popen('free -t -m').readlines()[-1].split()[1:])
One can get real time CPU and RAM monitoring by combining tqdm and psutil. It may be handy when running heavy computations / processing.
It also works in Jupyter without any code changes:
from tqdm import tqdm
from time import sleep
import psutil
with tqdm(total=100, desc='cpu%', position=1) as cpubar, tqdm(total=100, desc='ram%', position=0) as rambar:
while True:
rambar.n=psutil.virtual_memory().percent
cpubar.n=psutil.cpu_percent()
rambar.refresh()
cpubar.refresh()
sleep(0.5)
It's convenient to put those progress bars in separate process using multiprocessing library.
This code snippet is also available as a gist.
Below codes, without external libraries worked for me. I tested at Python 2.7.9
CPU Usage
import os
CPU_Pct=str(round(float(os.popen('''grep 'cpu ' /proc/stat | awk '{usage=($2+$4)*100/($2+$4+$5)} END {print usage }' ''').readline()),2))
print("CPU Usage = " + CPU_Pct) # print results
And Ram Usage, Total, Used and Free
import os
mem=str(os.popen('free -t -m').readlines())
"""
Get a whole line of memory output, it will be something like below
[' total used free shared buffers cached\n',
'Mem: 925 591 334 14 30 355\n',
'-/+ buffers/cache: 205 719\n',
'Swap: 99 0 99\n',
'Total: 1025 591 434\n']
So, we need total memory, usage and free memory.
We should find the index of capital T which is unique at this string
"""
T_ind=mem.index('T')
"""
Than, we can recreate the string with this information. After T we have,
"Total: " which has 14 characters, so we can start from index of T +14
and last 4 characters are also not necessary.
We can create a new sub-string using this information
"""
mem_G=mem[T_ind+14:-4]
"""
The result will be like
1025 603 422
we need to find first index of the first space, and we can start our substring
from from 0 to this index number, this will give us the string of total memory
"""
S1_ind=mem_G.index(' ')
mem_T=mem_G[0:S1_ind]
"""
Similarly we will create a new sub-string, which will start at the second value.
The resulting string will be like
603 422
Again, we should find the index of first space and than the
take the Used Memory and Free memory.
"""
mem_G1=mem_G[S1_ind+8:]
S2_ind=mem_G1.index(' ')
mem_U=mem_G1[0:S2_ind]
mem_F=mem_G1[S2_ind+8:]
print 'Summary = ' + mem_G
print 'Total Memory = ' + mem_T +' MB'
print 'Used Memory = ' + mem_U +' MB'
print 'Free Memory = ' + mem_F +' MB'
To get a line-by-line memory and time analysis of your program, I suggest using memory_profiler and line_profiler.
Installation:
# Time profiler
$ pip install line_profiler
# Memory profiler
$ pip install memory_profiler
# Install the dependency for a faster analysis
$ pip install psutil
The common part is, you specify which function you want to analyse by using the respective decorators.
Example: I have several functions in my Python file main.py that I want to analyse. One of them is linearRegressionfit(). I need to use the decorator #profile that helps me profile the code with respect to both: Time & Memory.
Make the following changes to the function definition
#profile
def linearRegressionfit(Xt,Yt,Xts,Yts):
lr=LinearRegression()
model=lr.fit(Xt,Yt)
predict=lr.predict(Xts)
# More Code
For Time Profiling,
Run:
$ kernprof -l -v main.py
Output
Total time: 0.181071 s
File: main.py
Function: linearRegressionfit at line 35
Line # Hits Time Per Hit % Time Line Contents
==============================================================
35 #profile
36 def linearRegressionfit(Xt,Yt,Xts,Yts):
37 1 52.0 52.0 0.1 lr=LinearRegression()
38 1 28942.0 28942.0 75.2 model=lr.fit(Xt,Yt)
39 1 1347.0 1347.0 3.5 predict=lr.predict(Xts)
40
41 1 4924.0 4924.0 12.8 print("train Accuracy",lr.score(Xt,Yt))
42 1 3242.0 3242.0 8.4 print("test Accuracy",lr.score(Xts,Yts))
For Memory Profiling,
Run:
$ python -m memory_profiler main.py
Output
Filename: main.py
Line # Mem usage Increment Line Contents
================================================
35 125.992 MiB 125.992 MiB #profile
36 def linearRegressionfit(Xt,Yt,Xts,Yts):
37 125.992 MiB 0.000 MiB lr=LinearRegression()
38 130.547 MiB 4.555 MiB model=lr.fit(Xt,Yt)
39 130.547 MiB 0.000 MiB predict=lr.predict(Xts)
40
41 130.547 MiB 0.000 MiB print("train Accuracy",lr.score(Xt,Yt))
42 130.547 MiB 0.000 MiB print("test Accuracy",lr.score(Xts,Yts))
Also, the memory profiler results can also be plotted using matplotlib using
$ mprof run main.py
$ mprof plot
Note: Tested on
line_profiler version == 3.0.2
memory_profiler version == 0.57.0
psutil version == 5.7.0
EDIT: The results from the profilers can be parsed using the TAMPPA package. Using it, we can get line-by-line desired plots as
We chose to use usual information source for this because we could find instantaneous fluctuations in free memory and felt querying the meminfo data source was helpful. This also helped us get a few more related parameters that were pre-parsed.
Code
import os
linux_filepath = "/proc/meminfo"
meminfo = dict(
(i.split()[0].rstrip(":"), int(i.split()[1]))
for i in open(linux_filepath).readlines()
)
meminfo["memory_total_gb"] = meminfo["MemTotal"] / (2 ** 20)
meminfo["memory_free_gb"] = meminfo["MemFree"] / (2 ** 20)
meminfo["memory_available_gb"] = meminfo["MemAvailable"] / (2 ** 20)
Output for reference (we stripped all newlines for further analysis)
MemTotal: 1014500 kB MemFree: 562680 kB MemAvailable: 646364 kB
Buffers: 15144 kB Cached: 210720 kB SwapCached: 0 kB Active: 261476 kB
Inactive: 128888 kB Active(anon): 167092 kB Inactive(anon): 20888 kB
Active(file): 94384 kB Inactive(file): 108000 kB Unevictable: 3652 kB
Mlocked: 3652 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback:
0 kB AnonPages: 168160 kB Mapped: 81352 kB Shmem: 21060 kB Slab: 34492
kB SReclaimable: 18044 kB SUnreclaim: 16448 kB KernelStack: 2672 kB
PageTables: 8180 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB
CommitLimit: 507248 kB Committed_AS: 1038756 kB VmallocTotal:
34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted:
0 kB AnonHugePages: 88064 kB CmaTotal: 0 kB CmaFree: 0 kB
HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp:
0 Hugepagesize: 2048 kB DirectMap4k: 43008 kB DirectMap2M: 1005568 kB
Here's something I put together a while ago, it's windows only but may help you get part of what you need done.
Derived from:
"for sys available mem"
http://msdn2.microsoft.com/en-us/library/aa455130.aspx
"individual process information and python script examples"
http://www.microsoft.com/technet/scriptcenter/scripts/default.mspx?mfr=true
NOTE: the WMI interface/process is also available for performing similar tasks
I'm not using it here because the current method covers my needs, but if someday it's needed to extend or improve this, then may want to investigate the WMI tools a vailable.
WMI for python:
http://tgolden.sc.sabren.com/python/wmi.html
The code:
'''
Monitor window processes
derived from:
>for sys available mem
http://msdn2.microsoft.com/en-us/library/aa455130.aspx
> individual process information and python script examples
http://www.microsoft.com/technet/scriptcenter/scripts/default.mspx?mfr=true
NOTE: the WMI interface/process is also available for performing similar tasks
I'm not using it here because the current method covers my needs, but if someday it's needed
to extend or improve this module, then may want to investigate the WMI tools available.
WMI for python:
http://tgolden.sc.sabren.com/python/wmi.html
'''
__revision__ = 3
import win32com.client
from ctypes import *
from ctypes.wintypes import *
import pythoncom
import pywintypes
import datetime
class MEMORYSTATUS(Structure):
_fields_ = [
('dwLength', DWORD),
('dwMemoryLoad', DWORD),
('dwTotalPhys', DWORD),
('dwAvailPhys', DWORD),
('dwTotalPageFile', DWORD),
('dwAvailPageFile', DWORD),
('dwTotalVirtual', DWORD),
('dwAvailVirtual', DWORD),
]
def winmem():
x = MEMORYSTATUS() # create the structure
windll.kernel32.GlobalMemoryStatus(byref(x)) # from cytypes.wintypes
return x
class process_stats:
'''process_stats is able to provide counters of (all?) the items available in perfmon.
Refer to the self.supported_types keys for the currently supported 'Performance Objects'
To add logging support for other data you can derive the necessary data from perfmon:
---------
perfmon can be run from windows 'run' menu by entering 'perfmon' and enter.
Clicking on the '+' will open the 'add counters' menu,
From the 'Add Counters' dialog, the 'Performance object' is the self.support_types key.
--> Where spaces are removed and symbols are entered as text (Ex. # == Number, % == Percent)
For the items you wish to log add the proper attribute name in the list in the self.supported_types dictionary,
keyed by the 'Performance Object' name as mentioned above.
---------
NOTE: The 'NETFramework_NETCLRMemory' key does not seem to log dotnet 2.0 properly.
Initially the python implementation was derived from:
http://www.microsoft.com/technet/scriptcenter/scripts/default.mspx?mfr=true
'''
def __init__(self,process_name_list=[],perf_object_list=[],filter_list=[]):
'''process_names_list == the list of all processes to log (if empty log all)
perf_object_list == list of process counters to log
filter_list == list of text to filter
print_results == boolean, output to stdout
'''
pythoncom.CoInitialize() # Needed when run by the same process in a thread
self.process_name_list = process_name_list
self.perf_object_list = perf_object_list
self.filter_list = filter_list
self.win32_perf_base = 'Win32_PerfFormattedData_'
# Define new datatypes here!
self.supported_types = {
'NETFramework_NETCLRMemory': [
'Name',
'NumberTotalCommittedBytes',
'NumberTotalReservedBytes',
'NumberInducedGC',
'NumberGen0Collections',
'NumberGen1Collections',
'NumberGen2Collections',
'PromotedMemoryFromGen0',
'PromotedMemoryFromGen1',
'PercentTimeInGC',
'LargeObjectHeapSize'
],
'PerfProc_Process': [
'Name',
'PrivateBytes',
'ElapsedTime',
'IDProcess',# pid
'Caption',
'CreatingProcessID',
'Description',
'IODataBytesPersec',
'IODataOperationsPersec',
'IOOtherBytesPersec',
'IOOtherOperationsPersec',
'IOReadBytesPersec',
'IOReadOperationsPersec',
'IOWriteBytesPersec',
'IOWriteOperationsPersec'
]
}
def get_pid_stats(self, pid):
this_proc_dict = {}
pythoncom.CoInitialize() # Needed when run by the same process in a thread
if not self.perf_object_list:
perf_object_list = self.supported_types.keys()
for counter_type in perf_object_list:
strComputer = "."
objWMIService = win32com.client.Dispatch("WbemScripting.SWbemLocator")
objSWbemServices = objWMIService.ConnectServer(strComputer,"root\cimv2")
query_str = '''Select * from %s%s''' % (self.win32_perf_base,counter_type)
colItems = objSWbemServices.ExecQuery(query_str) # "Select * from Win32_PerfFormattedData_PerfProc_Process")# changed from Win32_Thread
if len(colItems) > 0:
for objItem in colItems:
if hasattr(objItem, 'IDProcess') and pid == objItem.IDProcess:
for attribute in self.supported_types[counter_type]:
eval_str = 'objItem.%s' % (attribute)
this_proc_dict[attribute] = eval(eval_str)
this_proc_dict['TimeStamp'] = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.') + str(datetime.datetime.now().microsecond)[:3]
break
return this_proc_dict
def get_stats(self):
'''
Show process stats for all processes in given list, if none given return all processes
If filter list is defined return only the items that match or contained in the list
Returns a list of result dictionaries
'''
pythoncom.CoInitialize() # Needed when run by the same process in a thread
proc_results_list = []
if not self.perf_object_list:
perf_object_list = self.supported_types.keys()
for counter_type in perf_object_list:
strComputer = "."
objWMIService = win32com.client.Dispatch("WbemScripting.SWbemLocator")
objSWbemServices = objWMIService.ConnectServer(strComputer,"root\cimv2")
query_str = '''Select * from %s%s''' % (self.win32_perf_base,counter_type)
colItems = objSWbemServices.ExecQuery(query_str) # "Select * from Win32_PerfFormattedData_PerfProc_Process")# changed from Win32_Thread
try:
if len(colItems) > 0:
for objItem in colItems:
found_flag = False
this_proc_dict = {}
if not self.process_name_list:
found_flag = True
else:
# Check if process name is in the process name list, allow print if it is
for proc_name in self.process_name_list:
obj_name = objItem.Name
if proc_name.lower() in obj_name.lower(): # will log if contains name
found_flag = True
break
if found_flag:
for attribute in self.supported_types[counter_type]:
eval_str = 'objItem.%s' % (attribute)
this_proc_dict[attribute] = eval(eval_str)
this_proc_dict['TimeStamp'] = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.') + str(datetime.datetime.now().microsecond)[:3]
proc_results_list.append(this_proc_dict)
except pywintypes.com_error, err_msg:
# Ignore and continue (proc_mem_logger calls this function once per second)
continue
return proc_results_list
def get_sys_stats():
''' Returns a dictionary of the system stats'''
pythoncom.CoInitialize() # Needed when run by the same process in a thread
x = winmem()
sys_dict = {
'dwAvailPhys': x.dwAvailPhys,
'dwAvailVirtual':x.dwAvailVirtual
}
return sys_dict
if __name__ == '__main__':
# This area used for testing only
sys_dict = get_sys_stats()
stats_processor = process_stats(process_name_list=['process2watch'],perf_object_list=[],filter_list=[])
proc_results = stats_processor.get_stats()
for result_dict in proc_results:
print result_dict
import os
this_pid = os.getpid()
this_proc_results = stats_processor.get_pid_stats(this_pid)
print 'this proc results:'
print this_proc_results
I feel like these answers were written for Python 2, and in any case nobody's made mention of the standard resource package that's available for Python 3. It provides commands for obtaining the resource limits of a given process (the calling Python process by default). This isn't the same as getting the current usage of resources by the system as a whole, but it could solve some of the same problems like e.g. "I want to make sure I only use X much RAM with this script."
This aggregate all the goodies:
psutil + os to get Unix & Windows compatibility:
That allows us to get:
CPU
memory
disk
code:
import os
import psutil # need: pip install psutil
In [32]: psutil.virtual_memory()
Out[32]: svmem(total=6247907328, available=2502328320, percent=59.9, used=3327135744, free=167067648, active=3671199744, inactive=1662668800, buffers=844783616, cached=1908920320, shared=123912192, slab=613048320)
In [33]: psutil.virtual_memory().percent
Out[33]: 60.0
In [34]: psutil.cpu_percent()
Out[34]: 5.5
In [35]: os.sep
Out[35]: '/'
In [36]: psutil.disk_usage(os.sep)
Out[36]: sdiskusage(total=50190790656, used=41343860736, free=6467502080, percent=86.5)
In [37]: psutil.disk_usage(os.sep).percent
Out[37]: 86.5
Taken feedback from first response and done small changes
#!/usr/bin/env python
#Execute commond on windows machine to install psutil>>>>python -m pip install psutil
import psutil
print (' ')
print ('----------------------CPU Information summary----------------------')
print (' ')
# gives a single float value
vcc=psutil.cpu_count()
print ('Total number of CPUs :',vcc)
vcpu=psutil.cpu_percent()
print ('Total CPUs utilized percentage :',vcpu,'%')
print (' ')
print ('----------------------RAM Information summary----------------------')
print (' ')
# you can convert that object to a dictionary
#print(dict(psutil.virtual_memory()._asdict()))
# gives an object with many fields
vvm=psutil.virtual_memory()
x=dict(psutil.virtual_memory()._asdict())
def forloop():
for i in x:
print (i,"--",x[i]/1024/1024/1024)#Output will be printed in GBs
forloop()
print (' ')
print ('----------------------RAM Utilization summary----------------------')
print (' ')
# you can have the percentage of used RAM
print('Percentage of used RAM :',psutil.virtual_memory().percent,'%')
#79.2
# you can calculate percentage of available memory
print('Percentage of available RAM :',psutil.virtual_memory().available * 100 / psutil.virtual_memory().total,'%')
#20.8
"... current system status (current CPU, RAM, free disk space, etc.)" And "*nix and Windows platforms" can be a difficult combination to achieve.
The operating systems are fundamentally different in the way they manage these resources. Indeed, they differ in core concepts like defining what counts as system and what counts as application time.
"Free disk space"? What counts as "disk space?" All partitions of all devices? What about foreign partitions in a multi-boot environment?
I don't think there's a clear enough consensus between Windows and *nix that makes this possible. Indeed, there may not even be any consensus between the various operating systems called Windows. Is there a single Windows API that works for both XP and Vista?
This script for CPU usage:
import os
def get_cpu_load():
""" Returns a list CPU Loads"""
result = []
cmd = "WMIC CPU GET LoadPercentage "
response = os.popen(cmd + ' 2>&1','r').read().strip().split("\r\n")
for load in response[1:]:
result.append(int(load))
return result
if __name__ == '__main__':
print get_cpu_load()
For CPU details use psutil library
https://psutil.readthedocs.io/en/latest/#cpu
For RAM Frequency (in MHz) use the built in Linux library dmidecode and manipulate the output a bit ;). this command needs root permission hence supply your password too. just copy the following commend replacing mypass with your password
import os
os.system("echo mypass | sudo -S dmidecode -t memory | grep 'Clock Speed' | cut -d ':' -f2")
------------------- Output ---------------------------
1600 MT/s
Unknown
1600 MT/s
Unknown 0
more specificly
[i for i in os.popen("echo mypass | sudo -S dmidecode -t memory | grep 'Clock Speed' | cut -d ':' -f2").read().split(' ') if i.isdigit()]
-------------------------- output -------------------------
['1600', '1600']
you can read /proc/meminfo to get used memory
file1 = open('/proc/meminfo', 'r')
for line in file1:
if 'MemTotal' in line:
x = line.split()
memTotal = int(x[1])
if 'Buffers' in line:
x = line.split()
buffers = int(x[1])
if 'Cached' in line and 'SwapCached' not in line:
x = line.split()
cached = int(x[1])
if 'MemFree' in line:
x = line.split()
memFree = int(x[1])
file1.close()
percentage_used = int ( ( memTotal - (buffers + cached + memFree) ) / memTotal * 100 )
print(percentage_used)
Based on the cpu usage code by #Hrabal, this is what I use:
from subprocess import Popen, PIPE
def get_cpu_usage():
''' Get CPU usage on Linux by reading /proc/stat '''
sub = Popen(('grep', 'cpu', '/proc/stat'), stdout=PIPE, stderr=PIPE)
top_vals = [int(val) for val in sub.communicate()[0].split('\n')[0].split[1:5]]
return (top_vals[0] + top_vals[2]) * 100. /(top_vals[0] + top_vals[2] + top_vals[3])
You can use psutil or psmem with subprocess
example code
import subprocess
cmd = subprocess.Popen(['sudo','./ps_mem'],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
out,error = cmd.communicate()
memory = out.splitlines()
Reference
https://github.com/Leo-g/python-flask-cmd
You can always use the library recently released SystemScripter by using the command pip install SystemScripter. This is a library that uses the other library like psutil among others to create a full library of system information that spans from CPU to disk information.
For current CPU usage use the function:
SystemScripter.CPU.CpuPerCurrentUtil(SystemScripter.CPU()) #class init as self param if not work
This gets the usage percentage or use:
SystemScripter.CPU.CpuCurrentUtil(SystemScripter.CPU())
https://pypi.org/project/SystemScripter/#description
Run with crontab won't print pid
Setup: */1 * * * * sh dog.sh this line in crontab -e
import os
import re
CUT_OFF = 90
def get_cpu_load():
cmd = "ps -Ao user,uid,comm,pid,pcpu --sort=-pcpu | head -n 2 | tail -1"
response = os.popen(cmd, 'r').read()
arr = re.findall(r'\S+', response)
print(arr)
needKill = float(arr[-1]) > CUT_OFF
if needKill:
r = os.popen(f"kill -9 {arr[-2]}")
print('kill:', r)
if __name__ == '__main__':
# Test CPU with
# $ stress --cpu 1
# crontab -e
# Every 1 min
# */1 * * * * sh dog.sh
# ctlr o, ctlr x
# crontab -l
print(get_cpu_load())
Shell-out not needed for #CodeGench's solution, so assuming Linux and Python's standard libraries:
def cpu_load():
with open("/proc/stat", "r") as stat:
(key, user, nice, system, idle, _) = (stat.readline().split(None, 5))
assert key == "cpu", "'cpu ...' should be the first line in /proc/stat"
busy = int(user) + int(nice) + int(system)
return 100 * busy / (busy + int(idle))
I don't believe that there is a well-supported multi-platform library available. Remember that Python itself is written in C so any library is simply going to make a smart decision about which OS-specific code snippet to run, as you suggested above.

Parallel execution of bash task from python, extracting csv columns

I have a csv file with 7,221,032 columns and 37 rows. I need to map each column to a separate file, ideally from a python script. My attempt so far:
num_features = 7221032
binary_dir = "data_binary"
command_template = command = 'awk -F "\\"*,\\"*" \'{print $%s}\' %s/images_binary.txt > %s/feature_files/pixel_%s.vector'
batch_size = 100
batch_indexes = np.arange(1, num_features, batch_size)
for batch_index in batch_indexes[1:5]:
indexes = range(batch_index-batch_size, batch_index)
commands = [command_template % (str(i), binary_dir, binary_dir, str(i)) for i in indexes]
map(os.system, commands)
But, this appears to be rather a slow process.. Any advice on how to speed it up?
Revised solution - using Perl
Run with perl prog.pl < /path/to/images_binary.txt
Runtime is 10 sec for 100,000 items. Will take about 7 hours for complete data set. Not sure that running parallel will do better, since the bottleneck is open/close of files. You best choice to improve perform will be to reduce the number of generated files, somehow get the input written in column order first.
#! /usr/bin/perl
while ( my $x = <> ) {
chomp $x ;
my #v = split(',', $x) ;
foreach my $i (0..$#v) {
open OF, ">data_binary/feature_files/pixel_$i.vector" ;
print OF $v[$i], "\n" ;
close OF ;
} ;
} ;

Passing arguments vs. using pipes to communicate with child processes

In the following code, I time how long it takes to pass a large array (8 MB) to a child process using the args key word when forking the process verses passing using a pipe.
Does anyone have any insight into why it is so much faster to pass data using an argument than using a pipe?
Below, each code block is a cell in a Jupyter notebook.
import multiprocessing as mp
import random
N = 2**20
x = list(map(lambda x : random.random(),range(N)))
Time the call to sum in the parent process (for comparison only):
%%timeit -n 5 -r 10 -p 8 -o -q
pass
y = sum(x)/N
t_sum = _
Time the result of calling sum from a child process, using the args keyword to pass list x to child process.
def mean(x,q):
q.put(sum(x))
%%timeit -n 5 -r 10 -p 8 -o -q
pass
q = mp.Queue()
p = mp.Process(target=mean,args=(x,q))
p.start()
p.join()
s = q.get()
m = s/N
t_mean = _
Time using a pipe to pass data to child process
def mean_pipe(cp,q):
x = cp.recv()
q.put(sum(x))
%%timeit -n 5 -r 10 -p 8 -o -q
pass
q = mp.Queue()
pipe0,pipe1 = mp.Pipe()
p = mp.Process(target=mean_pipe,args=[pipe0,q])
p.start()
pipe1.send(x)
p.join()
s = q.get()
m = s/N
t_mean_pipe = _
(ADDED in response to comment) Use mp.Array shared memory feature (very slow!)
def mean_pipe_shared(xs,q):
q.put(sum(xs))
%%timeit -n 5 -r 10 -p 8 -o -q
xs = mp.Array('d',x)
q = mp.Queue()
p = mp.Process(target=mean_pipe_shared,args=[xs,q])
p.start()
p.join()
s = q.get()
m = s/N
t_mean_shared = _
Print out results (ms)
print("{:>20s} {:12.4f}".format("MB",8*N/1024**2))
print("{:>20s} {:12.4f}".format("mean (main)",1000*t_sum.best))
print("{:>20s} {:12.4f}".format("mean (args)",1000*t_mean.best))
print("{:>20s} {:12.4f}".format("mean (pipe)",1000*t_mean_pipe.best))
print("{:>20s} {:12.4f}".format("mean (shared)",1000*t_mean_shared.best))
MB 8.0000
mean (main) 7.1931
mean (args) 38.5217
mean (pipe) 136.5020
mean (shared) 4195.0568
Using the pipe is over 3 times slower than passing arguments to the child process. And unless I am doing something very wrong, mp.Array is a non-starter.
Why is the pipe so much slower than passing directly to the subprocess (using args)? And what's up with the shared memory?

call python with system() in R to run a python script emulating the python console

I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R?
>>> print 'hello world'
hello world
This only shows the output:
> system("python -c 'print \"hello world\"'")
hello world
Thanks!
BTW, I asked in r-help but have not got a response yet (if I do, I'll post the answer here).
Do you mean something like this?
export NUM=10
R -q -e "rnorm($NUM)"
You might also like to check out littler - http://dirk.eddelbuettel.com/code/littler.html
UPDATED
Following your comment below, I think I am beginning to understand your question better. You are asking about running python inside the R shell.
So here's an example:-
# code in a file named myfirstpythonfile.py
a = 1
b = 19
c = 3
mylist = [a, b, c]
for item in mylist:
print item
In your R shell, therefore, do this:
> system('python myfirstpythonfile.py')
1
19
3
Essentially, you can simply call python /path/to/your/python/file.py to execute a block of python code.
In my case, I can simply call python myfirstpythonfile.py assuming that I launched my R shell in the same directory (path) my python file resides.
FURTHER UPDATED
And if you really want to print out the source code, here's a brute force method that might be possible. In your R shell:-
> system('python -c "import sys; sys.stdout.write(file(\'myfirstpythonfile.py\', \'r\').read());"; python myfirstpythonfile.py')
a = 1
b = 19
c = 3
mylist = [a, b, c]
for item in mylist:
print item
1
19
3
AND FURTHER FURTHER UPDATED :-)
So if the purpose is to print the python code before the execution of a code, we can use the python trace module (reference: http://docs.python.org/library/trace.html). In command line, we use the -m option to call a python module and we specify the options for that python module following it.
So for my example above, it would be:-
$ python -m trace --trace myfirstpythonfile.py
--- modulename: myfirstpythonfile, funcname: <module>
myfirstpythonfile.py(1): a = 1
myfirstpythonfile.py(2): b = 19
myfirstpythonfile.py(3): c = 3
myfirstpythonfile.py(4): mylist = [a, b, c]
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
1
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
19
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
3
myfirstpythonfile.py(5): for item in mylist:
--- modulename: trace, funcname: _unsettrace
trace.py(80): sys.settrace(None)
Which as we can see, traces the exact line of python code, executes the result immediately after and outputs it into stdout.
The system command has an option called intern = FALSE. Make this TRUE and Whatever output was just visible before, will be stored in a variable.
Now run your system command with this option and you should get your output directly in your variable. Like this
tmp <- system("python -c 'print \"hello world\"'",intern=T)
My work around for this problem is defining my own functions that paste in parameters, write out a temporary .py file, and them execute the python file via a system call. Here is an example that calls ArcGIS's Euclidean Distance function:
py.EucDistance = function(poly_path,poly_name,snap_raster,out_raster_path_name,maximum_distance,mask){
py_path = 'G:/Faculty/Mann/EucDistance_temp.py'
poly_path_name = paste(poly_path,poly_name, sep='')
fileConn<-file(paste(py_path))
writeLines(c(
paste('import arcpy'),
paste('from arcpy import env'),
paste('from arcpy.sa import *'),
paste('arcpy.CheckOutExtension("spatial")'),
paste('out_raster_path_name = "',out_raster_path_name,'"',sep=""),
paste('snap_raster = "',snap_raster,'"',sep=""),
paste('cellsize =arcpy.GetRasterProperties_management(snap_raster,"CELLSIZEX")'),
paste('mask = "',mask,'"',sep=""),
paste('maximum_distance = "',maximum_distance,'"',sep=""),
paste('sr = arcpy.Describe(snap_raster).spatialReference'),
paste('arcpy.env.overwriteOutput = True'),
paste('arcpy.env.snapRaster = "',snap_raster,'"',sep=""),
paste('arcpy.env.mask = mask'),
paste('arcpy.env.scratchWorkspace ="G:/Faculty/Mann/Historic_BCM/Aggregated1080/Scratch.gdb"'),
paste('arcpy.env.outputCoordinateSystem = sr'),
# get spatial reference for raster and force output to that
paste('sr = arcpy.Describe(snap_raster).spatialReference'),
paste('py_projection = sr.exportToString()'),
paste('arcpy.env.extent = snap_raster'),
paste('poly_name = "',poly_name,'"',sep=""),
paste('poly_path_name = "',poly_path_name,'"',sep=""),
paste('holder = EucDistance(poly_path_name, maximum_distance, cellsize, "")'),
paste('holder = SetNull(holder < -9999, holder)'),
paste('holder.save(out_raster_path_name) ')
), fileConn, sep = "\n")
close(fileConn)
system(paste('C:\\Python27\\ArcGIS10.1\\python.exe', py_path))
}

What is the best, python or bash for selectively concatenating lots of files?

I have around 20000 files coming from the output of some program, and their names follow the format:
data1.txt
data2.txt
...
data99.txt
data100.txt
...
data999.txt
data1000.txt
...
data20000.txt
I would like to write a script that gets as input argument the number N. Then it makes blocks of N concatenated files, so if N=5, it would make the following new files:
data_new_1.txt: it would contain (concatenated) data1.txt to data5.txt (like cat data1.txt data2.txt ...> data_new_1.txt )
data_new_2.txt: it would contain (concatenated) data6.txt to data10.txt
.....
I wonder what do you think would be the best approach to do this, whether bash, python or another one like awk, perl, etc.
The best approach I mean in terms of simplest code.
Thanks
Here's a Python (2.6) version (if you have Python 2.5, add a first line that says
from __future__ import with_statement
and the script will also work)...:
import sys
def main(N):
rN = range(N)
for iout, iin in enumerate(xrange(1, 99999, N)):
with open('data_new_%s.txt' % (iout+1), 'w') as out:
for di in rN:
try: fin = open('data%s.txt' % (iin + di), 'r')
except IOError: return
out.write(fin.read())
fin.close()
if __name__ == '__main__':
if len(sys.argv) > 1:
N = int(sys.argv[1])
else:
N = 5
main(N)
As you see from other answers & comments, opinions on performance differ -- some believe that the Python startup (and imports of modules) will make this slower than bash (but the import part at least is bogus: sys, the only needed module, is a built-in module, requires no "loading" and therefore basically negligible overhead to import it); I suspect avoiding the repeated fork/exec of cat may slow bash down; others think that I/O will dominate anyway, making the two solutions equivalent. You'll have to benchmark with your own files, on your own system, to solve this performance doubt.
how about a one liner ? :)
ls data[0-9]*txt|sort -nk1.5|awk 'BEGIN{rn=5;i=1}{while((getline _<$0)>0){print _ >"data_new_"i".txt"}close($0)}NR%rn==0{i++}'
I like this one which saves on executing processes, only 1 cat per block
#! /bin/bash
N=5 # block size
S=1 # start
E=20000 # end
for n in $(seq $S $N $E)
do
CMD="cat "
i=$n
while [ $i -lt $((n + N)) ]
do
CMD+="data$((i++)).txt "
done
$CMD > data_new_$((n / N + 1)).txt
done
Best in what sense? Bash can do this quite well, but it may be harder for you to write a good bash script if you are more familiar with another scripting language. Do you want to optimize for something specific?
That said, here's a bash implementation:
declare blocksize=5
declare i=1
declare blockstart=1
declare blockend=$blocksize
declare -a fileset
while [ -f data${i}.txt ] ; do
fileset=("${fileset[#]}" $data${i}.txt)
i=$(($i + 1))
if [ $i -gt $blockend ] ; then
cat "${fileset[#]}" > data_new_${blockstart}.txt
fileset=() # clear
blockstart=$(($blockstart + $blocksize))
blockend=$(($blockend+ $blocksize))
fi
done
EDIT: I see you now say "Best" == "Simplest code", but what's simple depends on you. For me Perl is simpler than Python, for some Awk is simpler than bash. It depends on what you know best.
EDIT again: inspired by dtmilano, I've changed mine to use cat once per blocksize, so now cat will be called 'only' 4000 times.
Let's say, if you have a simple script that concatenates files and keeps a counter for you, like the following:
#!/usr/bin/bash
COUNT=0
if [ -f counter ]; then
COUNT=`cat counter`
fi
COUNT=$[$COUNT+1]
echo $COUNT > counter
cat $# > $COUNT.data
The a command line will do:
find -name "*" -type f -print0 | xargs -0 -n 5 path_to_the_script
Since this can easily be done in any shell I would simply use that.
This should do it:
#!/bin/sh
FILES=$1
FILENO=1
for i in data[0-9]*.txt; do
FILES=`expr $FILES - 1`
if [ $FILES -eq 0 ]; then
FILENO=`expr $FILENO + 1`
FILES=$1
fi
cat $i >> "data_new_${FILENO}.txt"
done
Python version:
#!/usr/bin/env python
import os
import sys
if __name__ == '__main__':
files_per_file = int(sys.argv[1])
i = 0
while True:
i += 1
source_file = 'data%d.txt' % i
if os.path.isfile(source_file):
dest_file = 'data_new_%d.txt' % ((i / files_per_file) + 1)
file(dest_file, 'wa').write(file(source_file).read())
else:
break
Simple enough?
make_cat.py
limit = 1000
n = 5
for i in xrange( 0, (limit+n-1)//n ):
names = [ "data{0}.txt".format(j) for j in range(i*n,i*n+n) ]
print "cat {0} >data_new_{1}.txt".format( " ".join(names), i )
Script
python make_cat.py | sh

Categories