Run selected lines of a Python Script on CMD - python

I am executing python scripts using cmd and have challenges in using other interfaces, so would be executing on CMD only. The python script has 113 lines of code. I want to run and test some selected subsetted line of codes before executing the complete script, without making new python scripts but from the parent script.
From example below (has 28 lines):
To run the parent script we say in cmd:
C:\Users\X> python myMasterDummyScript.py
Can we run just between lines 1 - 16
Dummy Example:
import numpy as np
from six.moves import range
from six.moves import cPickle as pickle
pickle_file = "C:\\A.pickle"
with open(pickle_file, 'rb') as f:
data = pickle.load(f, encoding ='latin1')
train_dataset = data['train_dataset']
test_dataset = data['test_dataset']
valid_dataset = data['valid_dataset']
train_labels = data['train_labels']
test_labels = data['test_labels']
valid_labels = data['valid_labels']
a = 28
b = 1
def reformat(dataset, labels):
dataset = dataset.reshape(-1, a, a, b).astype(np.float32)
labels = (np.arange(10)==labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)

Open the parent script in an interpreter like PyCharm and select the lines you want to execute & then right click -> Execute selection in console.

It would theoretically be possible with a bit of work, however please note that this is not how scripts work in general. Instead, you should consider grouping coherent routine sequences into named functions, and call them from command line.
Among other issues, you'll have to modify all calling code to your script every time you shift the line numbers, you'll have to repeat any imports any subsection would potentially need and it's generally not a good idea. I am still going to address it after I make a case for refactoring though...
Refactoring the script and calling specific functions
Consider this answer to Python: Run function from the command line
Your python script:
import numpy as np
from six.moves import range
from six.moves import cPickle as pickle
def load_data()
pickle_file = "C:\\A.pickle"
with open(pickle_file, 'rb') as f:
data = pickle.load(f, encoding ='latin1')
train_dataset = data['train_dataset']
test_dataset = data['test_dataset']
valid_dataset = data['valid_dataset']
train_labels = data['train_labels']
test_labels = data['test_labels']
valid_labels = data['valid_labels']
def main():
a = 28
b = 1
def reformat(dataset, labels):
dataset = dataset.reshape(-1, a, a, b).astype(np.float32)
labels = (np.arange(10)==labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
Your cmd code would look like this:
REM
REM any logic to identify which function to call
REM
python -c "import myMasterDummyScript; myMasterDummyScript.load_data()"
It also enables you to pass arguments from cmd into the function call.
Now if you're really adamant about running an arbitrary subset of lines from an overall python script...
How to run specific lines from a script in cmd
cmd to read those lines out of the original script and write them to a temporary script
Look at a proposed answer for batch script - read line by line. Adapting it slightly without so much of error management (which would significantly bloat this answer):
#echo off
#enabledelayedexpansion
SET startline=$1
SET endline=$2
SET originalscript=$3
SET tempscript=tempscript.py
SET line=1
REM erase tempscript
echo. > %tempscript%
for /f "tokens=*" %%a in (%originalscript%) do (
if %line% GEQ %startline% (
if %line% LEQ %endline% (
echo %%a >> %tempscript%
)
)
set /a line+=1
)
python %tempscript%
pause
You'd call it like this:
C:\> runlines.cmd 1 16 myMasterDummyScript.py

You could use the command line debugger pdb. As an example, given the following script:
print('1')
print('2')
print('3')
print('4')
print('5')
print('6')
print('7')
print('8')
print('9')
print('10')
print('11')
print('12')
print('13')
print('14')
print('15')
Here's a debug session that runs only lines 5-9 by jumping to line 5, setting a breakpoint at line 10, gives a listing to see the current line to be executed and breakpoints, and continuing execution. Type help to see all the commands available.
C:\>py -m pdb test.py
> c:\test.py(1)<module>()
-> print('1')
(Pdb) jump 5
> c:\test.py(5)<module>()
-> print('5')
(Pdb) b 10
Breakpoint 1 at c:\test.py:10
(Pdb) longlist
1 print('1')
2 print('2')
3 print('3')
4 print('4')
5 -> print('5')
6 print('6')
7 print('7')
8 print('8')
9 print('9')
10 B print('10')
11 print('11')
12 print('12')
13 print('13')
14 print('14')
15 print('15')
(Pdb) cont
5
6
7
8
9
> c:\test.py(10)<module>()
-> print('10')
(Pdb)

Option 1 - You can use a debugger to know everything in any moment of the code execution. (python -m pdb myscript.py debug your code too)
Option 2 - You can create a main file and a sub-files with your pieces of scripts and import in the main script and execute the main file or any separated file to test
Option 3 - You can use arguments (Using the argparse for example)
I've not more options at the moment

Related

How to distribute multiprocess CPU usage over multiple nodes?

I am trying to run a job on an HPC using multiprocessing. Each process has a peak memory usage of ~44GB. The job class I can use allows 1-16 nodes to be used, each with 32 CPUs and a memory of 124GB. Therefore if I want to run the code as quickly as possible (and within the max walltime limit) I should be able to run 2 CPUs on each node up to a maximum of 32 across all 16 nodes. However, when I specify mp.Pool(32) the job quickly exceeds the memory limit, I assume because more than two CPUs were used on a node.
My natural instinct was to specify 2 CPUs as the maximum in the pbs script I run my python script from, however this configuration is not permitted on the system. Would really appreciate any insight, having been scratching my head on this one for most of today - and have faced and worked around similar problems in the past without addressing the fundamentals at play here.
Simplified versions of both scripts below:
#!/bin/sh
#PBS -l select=16:ncpus=32:mem=124gb
#PBS -l walltime=24:00:00
module load anaconda3/personal
source activate py_env
python directory/script.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import multiprocessing as mp
def df_function(df, arr1, arr2):
df['col3'] = some_algorithm(df, arr1, arr2)
return df
def parallelize_dataframe(df, func, num_cores):
df_split = np.array_split(df, num_cores)
with mp.Pool(num_cores, maxtasksperchild = 10 ** 3) as pool:
df = pd.concat(pool.map(func, df_split))
return df
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = pd.read_csv(direc + file)
a = np.load(direc + a_file)
b = np.load(direc + b_file)
# Globally defining function with keyword defaults
global f
def f(df):
return df_function(df, arr1 = a, arr2 = b)
num_cores = 32 # i.e. 2 per node if evenly distributed.
# Running the function as a multiprocess:
df = parallelize_dataframe(df, f, num_cores)
# Saving:
df.to_csv(direc + 'outfile.csv', index = False)
if __name__ == '__main__':
main()
To run your job as-is, you could simply request ncpu=32 and then in your python script set num_cores = 2. Obviously this has you paying for 32 cores and then leaving 30 of them idle, which is wasteful.
The real problem here is that your current algorithm is memory-bound, not CPU-bound. You should be going to great lengths to read only chunks of your files into memory, operating on the chunks, and then writing the result chunks to disk to be organized later.
Fortunately Dask is built to do exactly this kind of thing. As a first step, you can take out the parallelize_dataframe function and directly load and map your some_algorithm with a dask.dataframe and dask.array:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import dask.dataframe as dd
import dask.array as da
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = dd.read_csv(direc + file, blocksize=25e6)
a_and_b = da.from_np_stack(direc)
df['col3'] = df.apply(some_algorithm, args=(a_and_b,))
# dask is lazy, this is the only line that does any work
# Saving:
df.to_csv(
direc + 'outfile.csv',
index = False,
compute_kwargs={"scheduler": "threads"}, # also "processes", but try threads first
)
if __name__ == '__main__':
main()
That will require some tweaks to some_algorithm, and to_csv and from_np_stack work a bit differently, but you will be able to reasonably run this thing just on your own laptop and it will scale to your cluster hardware. You can level up from here by using the distributed scheduler or even deploy it directly to your cluster with dask-jobqueue.

Running a pytest test from a python file and NOT from a command line

I have three python files in one directory (c:\Tests), I am trying to run the test using pytest from the file TestCases1.py but I have not succeed. I am new to python and I do not know if I am asking the right question. I have seen several examples but almost all use the command line to run the test and I want to run them from a python file. Since I am newbie to testing, I would appreciate a very simple answer (I have seen some similar questions but I did not get the answers). I am using Python 36-32 and Eclipse Oxygen 3a.
min_max.py => Some basic functions to be tested
def min(values):
_min = values[0]
for val in values:
if val < _min:
_min = val
return _min
def max(values):
_max = values[0]
for val in values:
if val > _max:
_max = valal
return _max
min_max_test.py => Some tests for the functions
import min_max
def test_min():
print("starting")
values = (2, 3, 1, 4, 6)
val = min(values)
assert val == 1
print("done test_min")
def test_max():
print("starting")
values = (2, 3, 1, 4, 6)
val = max(values)
assert val == 6
print("done test_max")
TestCases1.py => File from where I want to run the test
import pytest
pytest_args = [
'c:\Tests\min_max_test.py'
]
pytest.main(pytest_args)
Optionally, you could use subprocess to run pytest commands on your python script. For example,
# ~/tests
import subprocess
subprocess.run(["pytest . -q"], shell=True)
>>>
. [100%]
1 passed in 0.00s
CompletedProcess(args=['pytest . -q'], returncode=0)
In min_max_test.py, the min and max variable names in the test functions would be taken from the built-ins and not from your min_max.py file.
You either need to use something like min_max.min or import those functions using a from import rather than a full module import.
P.S. please include the error messages along with the questions and be specific as to what problem you are having. makes it that much easier :)

scriptExit 1 with pybedtools venn_mpl - snakemake 5.2.4

I want to create VennDiagramms with pybedtools. There is a special script using matplotlib called venn_mpl. It works perfectly when I try it out in my jupyter notebook. You can do it with python or using shell commands.
Unfortunately something wents wrong when I want to use it in my snakefile and I can’t really figure out what the problem is.
First of all, this is the script: venn_mpl.py
#!/gnu/store/3w3nz0h93h7jif9d9c3hdfyimgkpx1a4-python-wrapper-3.7.0/bin/python
"""
Given 3 files, creates a 3-way Venn diagram of intersections using matplotlib; \
see :mod:`pybedtools.contrib.venn_maker` for more flexibility.
Numbers are placed on the diagram. If you don't have matplotlib installed.
try venn_gchart.py to use the Google Chart API instead.
The values in the diagram assume:
* unstranded intersections
* no features that are nested inside larger features
"""
import argparse
import sys
import os
import pybedtools
def venn_mpl(a, b, c, colors=None, outfn='out.png', labels=None):
"""
*a*, *b*, and *c* are filenames to BED-like files.
*colors* is a list of matplotlib colors for the Venn diagram circles.
*outfn* is the resulting output file. This is passed directly to
fig.savefig(), so you can supply extensions of .png, .pdf, or whatever your
matplotlib installation supports.
*labels* is a list of labels to use for each of the files; by default the
labels are ['a','b','c']
"""
try:
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
except ImportError:
sys.stderr.write('matplotlib is required to make a Venn diagram with %s\n' % os.path.basename(sys.argv[0]))
sys.exit(1)
a = pybedtools.BedTool(a)
b = pybedtools.BedTool(b)
c = pybedtools.BedTool(c)
if colors is None:
colors = ['r','b','g']
radius = 6.0
center = 0.0
offset = radius / 2
if labels is None:
labels = ['a','b','c']
Then my code:
rule venndiagramm_data:
input:
data = expand("bed_files/{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
output:
"figures/Venn_PR1_PR2_GUI_data.png"
run:
col = ['g','k','b']
lab = ['PR1_data','PR2_data','GUI_data']
venn_mpl(input.data[0], input.data[1], input.data[2], colors = col, labels = lab, outfn = output)
The error is:
SystemExit in line 62 of snakemake_generatingVennDiagramm.py:
1
The snakemake-log only gives me:
rule venndiagramm_data:
input: bed_files/A_peaks.narrowPeak,bed_files/B_peaks.narrowPeak, bed_files/C_peaks.narrowPeak
output: figures/Venn_PR1_PR2_GUI_data.png
jobid: 2
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I already tried to add as written in the documentation:
rule error:
shell:
"""
set +e
somecommand ...
exitcode=$?
if [ $exitcode -eq 1 ]
then
exit 1
else
exit 0
fi
"""
but this changed nothing.
Then my next idea was to just do it while using the shell command which I also tested before and which worked perfectly. But then I got a different but I think quite similar error message for which I didn’t found a proper solution too:
rule venndiagramm_data_shell:
input:
data = expand("bed_files/{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
output:
"figures/Venn_PR1_PR2_GUI_data.png"
shell:
"venn_mpl.py -a {input.data[0]} -b {input.data[1]} -c {input.data[2]} --color 'g,k,b' --labels 'PR1_data,PR2_data,GUI_data'"
The snakemake log:
[Thu May 23 16:37:27 2019]
rule venndiagramm_data_shell:
input: bed_files/A_peaks.narrowPeak, bed_files/B_peaks.narrowPeak, bed_files/C_peaks.narrowPeak
output: figures/Venn_PR1_PR2_GUI_data.png
jobid: 1
[Thu May 23 16:37:29 2019]
Error in rule venndiagramm_data_shell:
jobid: 1
output: figures/Venn_PR1_PR2_GUI_data.png
RuleException:
CalledProcessError in line 45 of snakemake_generatingVennDiagramm.py:
Command ' set -euo pipefail; venn_mpl.py -a input.data[0] -b input.data[1] -c input.data[2] --color 'g,k,b' --labels 'PR1_data,PR2_data,GUI_data' ' returned non-zero exit status 1.
Does anyone has an idea what could be the reason for this and how to fix it?
FYI: I said that I tested it, without running it with snakemake. This is my working code:
from snakemake.io import expand
import yaml
import pybedtools
from pybedtools.scripts.venn_mpl import venn_mpl
config_text_real = """
samples:
data:
- A
- B
- C
control:
- A_input
- B_input
- C_input
"""
config_vennDiagramm = yaml.load(config_text_real)
config = config_vennDiagramm
data = expand("{sample}_peaks.narrowPeak", sample=config["samples"]["data"])
col = ['g','k','b']
lab = ['PR1_data','PR2_data','GUI_data']
venn_mpl(data[0], data[1], data[2], colors = col, labels = lab, outfn = 'Venn_PR1_PR2_GUI_data.png')
control = expand("{sample}_peaks.narrowPeak", sample=config["samples"]["control"])
lab = ['PR1_control','PR2_control','GUI_control']
venn_mpl(control[0], control[1], control[2], colors = col, labels = lab, outfn = 'Venn_PR1_PR2_GUI_control.png')
and within my jupyter notebook for shell:
!A='../path/to/file/A_peaks.narrowPeak'
!B='../path/to/file/B_peaks.narrowPeak'
!C='../path/to/file/C_peaks.narrowPeak'
!col=g,k,b
!lab='PR1_data, PR2_data, GUI_data'
!venn_mpl.py -a ../path/to/file/A_peaks.narrowPeak -b ../path/to/file/B_peaks.narrowPeak -c ../path/to/file/C_peaks.narrowPeak --color "g,k,b" --labels "PR1_data, PR2_data, GUI_data"
The reason why I used the full path instead of the variable is, because for some reason the code didn't worked with calling the variable with "$A" .
Not sure if this fixes it, but one thing I notice is that:
shell:
"venn_mpl.py -a input.data[0] -b input.data[1] -c input.data[2]..."
probably should be:
shell:
"venn_mpl.py -a {input.data[0]} -b {input.data[1]} -c {input.data[2]}..."

Display R plots running from Python

I have the following R script called Test.R:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(2,4,6,8,10,12,14,16,18,20)
plot(x,y, type="o")
x
y
I am running it via Python using this Python script called Test.py:
import subprocess
proc = subprocess.Popen(['Path/To/Rscript.exe',
'Path/To/Test.R'],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
print stdout
# Alternative Code to see output
# retcode = subprocess.call(['Path/To/Rscript.exe',
# 'Path/To/Test.R'])
When I run the python script Test.py, I get the following output in Pycharm:
[1] 1 2 3 4 5 6 7 8 9 10
[1] 2 4 6 8 10 12 14 16 18 20
So the usual text results show up fine, but how do I get the plots to show? I've tried changing the file from Rscript.txt to Rgui.exe but I get the following error and it only opens up Rgui:
ARGUMENT Path/To/Test.R __ignored__
is there an easy way for the output to display? I know this is a simple problem but I'm wondering how this will extend to other plot commands in R like acf() or pacf(). Should I use ggplot2 to save he plots and just tell Python to open the saved files?
Thanks.
Add:
show()
after:
plot(x,y, type="o")

call python with system() in R to run a python script emulating the python console

I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R?
>>> print 'hello world'
hello world
This only shows the output:
> system("python -c 'print \"hello world\"'")
hello world
Thanks!
BTW, I asked in r-help but have not got a response yet (if I do, I'll post the answer here).
Do you mean something like this?
export NUM=10
R -q -e "rnorm($NUM)"
You might also like to check out littler - http://dirk.eddelbuettel.com/code/littler.html
UPDATED
Following your comment below, I think I am beginning to understand your question better. You are asking about running python inside the R shell.
So here's an example:-
# code in a file named myfirstpythonfile.py
a = 1
b = 19
c = 3
mylist = [a, b, c]
for item in mylist:
print item
In your R shell, therefore, do this:
> system('python myfirstpythonfile.py')
1
19
3
Essentially, you can simply call python /path/to/your/python/file.py to execute a block of python code.
In my case, I can simply call python myfirstpythonfile.py assuming that I launched my R shell in the same directory (path) my python file resides.
FURTHER UPDATED
And if you really want to print out the source code, here's a brute force method that might be possible. In your R shell:-
> system('python -c "import sys; sys.stdout.write(file(\'myfirstpythonfile.py\', \'r\').read());"; python myfirstpythonfile.py')
a = 1
b = 19
c = 3
mylist = [a, b, c]
for item in mylist:
print item
1
19
3
AND FURTHER FURTHER UPDATED :-)
So if the purpose is to print the python code before the execution of a code, we can use the python trace module (reference: http://docs.python.org/library/trace.html). In command line, we use the -m option to call a python module and we specify the options for that python module following it.
So for my example above, it would be:-
$ python -m trace --trace myfirstpythonfile.py
--- modulename: myfirstpythonfile, funcname: <module>
myfirstpythonfile.py(1): a = 1
myfirstpythonfile.py(2): b = 19
myfirstpythonfile.py(3): c = 3
myfirstpythonfile.py(4): mylist = [a, b, c]
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
1
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
19
myfirstpythonfile.py(5): for item in mylist:
myfirstpythonfile.py(6): print item
3
myfirstpythonfile.py(5): for item in mylist:
--- modulename: trace, funcname: _unsettrace
trace.py(80): sys.settrace(None)
Which as we can see, traces the exact line of python code, executes the result immediately after and outputs it into stdout.
The system command has an option called intern = FALSE. Make this TRUE and Whatever output was just visible before, will be stored in a variable.
Now run your system command with this option and you should get your output directly in your variable. Like this
tmp <- system("python -c 'print \"hello world\"'",intern=T)
My work around for this problem is defining my own functions that paste in parameters, write out a temporary .py file, and them execute the python file via a system call. Here is an example that calls ArcGIS's Euclidean Distance function:
py.EucDistance = function(poly_path,poly_name,snap_raster,out_raster_path_name,maximum_distance,mask){
py_path = 'G:/Faculty/Mann/EucDistance_temp.py'
poly_path_name = paste(poly_path,poly_name, sep='')
fileConn<-file(paste(py_path))
writeLines(c(
paste('import arcpy'),
paste('from arcpy import env'),
paste('from arcpy.sa import *'),
paste('arcpy.CheckOutExtension("spatial")'),
paste('out_raster_path_name = "',out_raster_path_name,'"',sep=""),
paste('snap_raster = "',snap_raster,'"',sep=""),
paste('cellsize =arcpy.GetRasterProperties_management(snap_raster,"CELLSIZEX")'),
paste('mask = "',mask,'"',sep=""),
paste('maximum_distance = "',maximum_distance,'"',sep=""),
paste('sr = arcpy.Describe(snap_raster).spatialReference'),
paste('arcpy.env.overwriteOutput = True'),
paste('arcpy.env.snapRaster = "',snap_raster,'"',sep=""),
paste('arcpy.env.mask = mask'),
paste('arcpy.env.scratchWorkspace ="G:/Faculty/Mann/Historic_BCM/Aggregated1080/Scratch.gdb"'),
paste('arcpy.env.outputCoordinateSystem = sr'),
# get spatial reference for raster and force output to that
paste('sr = arcpy.Describe(snap_raster).spatialReference'),
paste('py_projection = sr.exportToString()'),
paste('arcpy.env.extent = snap_raster'),
paste('poly_name = "',poly_name,'"',sep=""),
paste('poly_path_name = "',poly_path_name,'"',sep=""),
paste('holder = EucDistance(poly_path_name, maximum_distance, cellsize, "")'),
paste('holder = SetNull(holder < -9999, holder)'),
paste('holder.save(out_raster_path_name) ')
), fileConn, sep = "\n")
close(fileConn)
system(paste('C:\\Python27\\ArcGIS10.1\\python.exe', py_path))
}

Categories