Calling a jar file from Python using JPype-total newbie query - python

So I have been using subprocess.call to run a jar file from Python as so:
subprocess.call(['java','-jar','jarFile.jar',-a','input_file','output_file'])
where it writes the result to an external output_file file. and -a is an option.
I now want to analyse output_file in python but want to avoid opening the file again. So I want to run jarFile.jar as a Python function, like:
output=jarFile(input_file)
I have installed JPype and got it working, I have set the class path and started the JVM environment:
import jpype
classpath="/home/me/folder/jarFile.jar"
jpype.startJVM(jpype.getDefaultJVMPath(),"-Djava.class.path=%s"%classpath)
and am now stuck...

java -jar jarFile.jar executes the main method of a class file that is configured in the jar's manifest file.
You find that class name if you extract the jar file's META-INF/MANIFEST.MF (open the jar with any zip tool). Look for the value of Main-Class. If that's for instance com.foo.bar.Application you should be able to call the main method like this
def jarFile(input_file):
# jpype is started as you already did
assert jpype.isJVMStarted()
tf = tempfile.NamedTemporaryFile()
jpype.com.foo.bar.Application.main(['-a', input_file, tf.name])
return tf
(I'm not sure about the correct use of the tempfile module, please check yourself)

Related

How to print to terminal like git log?

I'm writing a simple CLI application using python.
I have a list of records that I want to print in the terminal and I would like to output them just like git log does. So as a partial list you can load more using the down arrow, from which you quit by typing "q" (basically like the output of less, but without opening and closing a file).
How does git log do that?
You can pipe directly to a pager like this answer should work.
Alternatively, you can use a temporary file:
import os
import tempfile
import subprocess
# File contents for testing, replace with whatever
file = '\n'.join(f"{i} abc 123"*15 for i in range(400))
# Save the file to the disk
with tempfile.NamedTemporaryFile('w+', delete=False) as f:
f.write(file)
# Run `less` on the saved file
subprocess.check_call(["less", f.name])
# Delete the temporary file now that we are done with it.
os.unlink(f.name)
Device you are looking for is called pager, there exists pipepager function inside pydoc, which is not documented in linked pydoc docs, but using interactive python console you might learn that
>>> help(pydoc.pipepager)
Help on function pipepager in module pydoc:
pipepager(text, cmd)
Page through text by feeding it to another program.
therefore it seems that you should use this as follows
import pydoc
pydoc.pipepager("your text here", "less")
with limitation that it does depends on availability of less command.
How does git log do that?
git log invokes less when the output will not fit on the terminal. You can check that by running git log (if the repo doesn't have a lot of commits you can just resize the terminal before running the command) and then checking the running processes like so ps aux | grep less

question about calling Python script from jenkins and return value to pipeline

I tried to invoke a Python script that I wrote in PyCharm from a groovy pipeline. The Python code runs and finds a relevant folder name as a string which I need to get back in groovy as a variable that I can put in a path in order to run a bat file in that relevant folder.
this is my jenkins pipeline stage that calls the python script.
stage('Most Updated Version'){
steps{
script{
echo '------------------------------- RUN PYTHON SCRIPT -------------------------------'
def returnedVersion = bat """${python_27} "C:\\Desktop\\updatedDirectory.py" """
echo '------------------------------- PYTHON SCRIPT EXECUTED -------------------------------'
echo '------------------------------- ADD VERSION NUMBER TO PATH -------------------------------'
bat "call C:\\component\\returnedVersion\\Run_program.bat"
}
}
}
I tried to store the result in returnVersion variable and then put returnVersion in the relevant place in that path. this doesn't work. is this the right way to do it? or is there a better way? (I'm new to groovy)
this is my python script. do I need to return or add there any specific thing?
import os
import glob
def receiveMostUpdatedFolderNumber():
list_of_files = glob.glob('C:\component/[0-9]*')
latest_file = max(list_of_files, key=os.path.getmtime)
version = latest_file[11:]
print(version)
return version
res = receiveMostUpdatedFolderNumber()
I thought about storing version in a .txt file and then trying to open it with jenkins. is that a good way?

Downloading files from Google Storage using Spark (Python) and Dataproc

I have an application that parallelizes the execution of Python objects that process data to be downloaded from Google Storage (my project bucket). The cluster is created using Google Dataproc. The problem is that the data is never downloaded! I wrote a test program to try and understand the problem.
I wrote the following function to copy the files from the bucket and to see if creating files on workers does work:
from subprocess import call
from os.path import join
def copyDataFromBucket(filename,remoteFolder,localFolder):
call(["gsutil","-m","cp",join(remoteFolder,filename),localFolder]
def execTouch(filename,localFolder):
call(["touch",join(localFolder,"touched_"+filename)])
I've tested this function by calling it from a python shell and it works. But when I run the following code using spark-submit, the files are not downloaded (but no error is raised):
# ...
filesRDD = sc.parallelize(fileList)
filesRDD.foreach(lambda myFile: copyDataFromBucket(myFile,remoteBucketFolder,'/tmp/output')
filesRDD.foreach(lambda myFile: execTouch(myFile,'/tmp/output')
# ...
The execTouch function works (I can see the files on each worker) but the copyDataFromBucket function does nothing.
So what am I doing wrong?
The problem was clearly the Spark context. Replacing the call to "gsutil" by a call to "hadoop fs" solves it:
from subprocess import call
from os.path import join
def copyDataFromBucket(filename,remoteFolder,localFolder):
call(["hadoop","fs","-copyToLocal",join(remoteFolder,filename),localFolder]
I also did a test to send data to the bucket. One only needs to replace "-copyToLocal" by "-copyFromLocal"

How to access a file from python when the OS opens the script to handle opening that file?

When I open an HTML file, for instance, I have it set such that it opens in Chrome. Now if I set a given python script to be the thing that opens a given filetype, how do I access this file in the script? Where is it available from?
When opening a file, the operating system starts the responsible opener program and passes the file(s) to be opened as command line arguments:
path/to/opener_executable path/to/file1_to_be_opened path/to/file2_to_be_opened ...
You can access the command line arguments through sys.argv in your python script. A minimal example:
import sys
print("I'm supposed to open the following file(s):")
print('\n'.join(sys.argv[1:]))
To prove Rawing's point, on Linux, you "open with Other Application" and select your python script, which you made executable.
sys.argv provides the name of the script as argument 0 and thereafter a list of any other parameters.
myopener.py
#!/usr/bin/env python
import sys, os
x=os.open('/home/rolf/myopener.output',os.O_RDWR|os.O_CREAT)
xx = os.fdopen(x,'w+')
y=str(sys.argv)
xx.write(y)
xx.close()
Opening the file abc.ddd with myopener.py creates the file myopener.output contents:
['/home/rolf/myopener.py', '/home/rolf/abc.ddd']

Adding a JAR file to Python Script

I am trying to use a JAR file and import its functionality into my python script. The jar file is located in the same directory as my python script and pig script
script.py
import sys
sys.path.append('/home/hadoop/scripts/jyson-1.0.2.jar')
from com.xhaus.jyson import JysonCodec as json
#outputSchema('output_field_name:chararray')
def get_team(arg0):
return json.loads(arg0)
script.pig
register 'script.py' using jython as script_udf;
a = LOAD 'data.json' USING PigStorage('*') as (line:chararray);
teams = FOREACH a GENERATE script_udf.get_team(line);
dump teams;
It is a very simple UDF that I am trying to use, but for some reason I always get an error saying "No module named xhaus". Here are all the classes in that jar.
$ jar tf jyson-1.0.2.jar
META-INF/
META-INF/MANIFEST.MF
com/
com/xhaus/
com/xhaus/jyson/
com/xhaus/jyson/JSONDecodeError.class
com/xhaus/jyson/JSONEncodeError.class
com/xhaus/jyson/JSONError.class
com/xhaus/jyson/JysonCodec.class
com/xhaus/jyson/JysonDecoder.class
com/xhaus/jyson/JysonEncoder.class
So xhaus exists in the jar, but for some reason this is not being picked up. When I look at a few tutorials, they are able to run these scripts fine. I might be missing a silly detail, please help.
EDIT:
This script is executed by pig. So the pig script calls the python script. And the python script uses the JysonCodec class.
pig script.pig
In case you are running this script in pig map reduce mode you need to make the jar available at the job runtime. On the top of your pig script you need to add the following line
REGISTER /home/hadoop/scripts/jyson-1.0.2.jar;
Then you need to comment out sys.path.append('/home/hadoop/scripts/jyson-1.0.2.jar')
from your udf script. The classes from the jar will already be available to the udf since you have registered that with the pig script. So need to change sys.path
Hope it helps.

Categories