Downloading files from Google Storage using Spark (Python) and Dataproc

Downloading files from Google Storage using Spark (Python) and Dataproc - python

I have an application that parallelizes the execution of Python objects that process data to be downloaded from Google Storage (my project bucket). The cluster is created using Google Dataproc. The problem is that the data is never downloaded! I wrote a test program to try and understand the problem.
I wrote the following function to copy the files from the bucket and to see if creating files on workers does work:
from subprocess import call
from os.path import join
def copyDataFromBucket(filename,remoteFolder,localFolder):
call(["gsutil","-m","cp",join(remoteFolder,filename),localFolder]
def execTouch(filename,localFolder):
call(["touch",join(localFolder,"touched_"+filename)])
I've tested this function by calling it from a python shell and it works. But when I run the following code using spark-submit, the files are not downloaded (but no error is raised):
# ...
filesRDD = sc.parallelize(fileList)
filesRDD.foreach(lambda myFile: copyDataFromBucket(myFile,remoteBucketFolder,'/tmp/output')
filesRDD.foreach(lambda myFile: execTouch(myFile,'/tmp/output')
# ...
The execTouch function works (I can see the files on each worker) but the copyDataFromBucket function does nothing.
So what am I doing wrong?

The problem was clearly the Spark context. Replacing the call to "gsutil" by a call to "hadoop fs" solves it:
from subprocess import call
from os.path import join
def copyDataFromBucket(filename,remoteFolder,localFolder):
call(["hadoop","fs","-copyToLocal",join(remoteFolder,filename),localFolder]
I also did a test to send data to the bucket. One only needs to replace "-copyToLocal" by "-copyFromLocal"

Related

How to print to terminal like git log?

I'm writing a simple CLI application using python.
I have a list of records that I want to print in the terminal and I would like to output them just like git log does. So as a partial list you can load more using the down arrow, from which you quit by typing "q" (basically like the output of less, but without opening and closing a file).
How does git log do that?

You can pipe directly to a pager like this answer should work.
Alternatively, you can use a temporary file:
import os
import tempfile
import subprocess
# File contents for testing, replace with whatever
file = '\n'.join(f"{i} abc 123"*15 for i in range(400))
# Save the file to the disk
with tempfile.NamedTemporaryFile('w+', delete=False) as f:
f.write(file)
# Run `less` on the saved file
subprocess.check_call(["less", f.name])
# Delete the temporary file now that we are done with it.
os.unlink(f.name)

Device you are looking for is called pager, there exists pipepager function inside pydoc, which is not documented in linked pydoc docs, but using interactive python console you might learn that
>>> help(pydoc.pipepager)
Help on function pipepager in module pydoc:
pipepager(text, cmd)
Page through text by feeding it to another program.
therefore it seems that you should use this as follows
import pydoc
pydoc.pipepager("your text here", "less")
with limitation that it does depends on availability of less command.

How does git log do that?
git log invokes less when the output will not fit on the terminal. You can check that by running git log (if the repo doesn't have a lot of commits you can just resize the terminal before running the command) and then checking the running processes like so ps aux | grep less

Nextflow - Channel.watchPath() method

I am trying to use nextflow to gain some concurrency from my python scripts so some of my dataflow doesn't use the traditional nextflow ideals.
In my first process I create files by invoking a python script, in my second process I want to use those files created
I created a new channel that watches for the path where the files are created but nothing seems to happen. I tested with the .fromPath method and my process is successful, so I am not sure whats going wrong?
mutFiles = Channel.watchPath(launchDir + '/output/mutFiles/*.mutfile')
process structurePrediction{
input:
file mutFiles
output:
stdout results
"""
test.py ${mutFiles}
"""
}

Python os.listdir() doesn't return some files

I have a python function that checks the modified time of a folder using os.stat() and then it does a os.listdir() of that folder.
I have another process which create files in the folder being checked.
Sometimes it is observed that three files are created with same timestamp and the folder's stat also has the same timestamp.
When the python function fetches the the files in the same millisecond, it is observed that os.listdir() is providing only 2 of the 3 created files.
Why is this so ?
The environment is :
OS: Red Hat Enterprise Linux Server release 7.6
Python Version: Python 3.6.8
Filesystem : xfs
Sample code
import os
import sys
import time
filelist=list()
last_mtime=None
def walkover():
path_to_check="/path/to/check"
curr_mtime=os.stat(path_to_check).st_mtime_ns
global last_mtime
global filelist
if last_mtime == None or curr_mtime > last_mtime:
for file in os.listdir(path_to_check):
if file not in filelist:
filelist.append(file)
last_mtime = curr_mtime
print ("{} modified at {}".format(path_to_check, last_mtime))
The function is invoked to maintain a list of files at a point of time.
The if case is present to avoid multiple os.listdir() invokations.
Edit:
The files are ".rsp" files which gets created by ninja when a ".o" is about to get built.
Since my machine has multiple cores (16), the ninja is triggered from cmake with "--parallel 16" . This will cause 16 compilations to happen parallely .

I don't know this ninja thing but it seems to me that when writing the first ".rst" file and running the walkover function are executed at approximately the same time the walkover function may be missing out on the creation of the other files. Wether it's buffers or queues or whatever, specifically when parallel processes are in play the order of events can be hard to grasp.

Adding a JAR file to Python Script

I am trying to use a JAR file and import its functionality into my python script. The jar file is located in the same directory as my python script and pig script
script.py
import sys
sys.path.append('/home/hadoop/scripts/jyson-1.0.2.jar')
from com.xhaus.jyson import JysonCodec as json
#outputSchema('output_field_name:chararray')
def get_team(arg0):
return json.loads(arg0)
script.pig
register 'script.py' using jython as script_udf;
a = LOAD 'data.json' USING PigStorage('*') as (line:chararray);
teams = FOREACH a GENERATE script_udf.get_team(line);
dump teams;
It is a very simple UDF that I am trying to use, but for some reason I always get an error saying "No module named xhaus". Here are all the classes in that jar.
$ jar tf jyson-1.0.2.jar
META-INF/
META-INF/MANIFEST.MF
com/
com/xhaus/
com/xhaus/jyson/
com/xhaus/jyson/JSONDecodeError.class
com/xhaus/jyson/JSONEncodeError.class
com/xhaus/jyson/JSONError.class
com/xhaus/jyson/JysonCodec.class
com/xhaus/jyson/JysonDecoder.class
com/xhaus/jyson/JysonEncoder.class
So xhaus exists in the jar, but for some reason this is not being picked up. When I look at a few tutorials, they are able to run these scripts fine. I might be missing a silly detail, please help.
EDIT:
This script is executed by pig. So the pig script calls the python script. And the python script uses the JysonCodec class.
pig script.pig

In case you are running this script in pig map reduce mode you need to make the jar available at the job runtime. On the top of your pig script you need to add the following line
REGISTER /home/hadoop/scripts/jyson-1.0.2.jar;
Then you need to comment out sys.path.append('/home/hadoop/scripts/jyson-1.0.2.jar')
from your udf script. The classes from the jar will already be available to the udf since you have registered that with the pig script. So need to change sys.path
Hope it helps.

Calling a jar file from Python using JPype-total newbie query

So I have been using subprocess.call to run a jar file from Python as so:
subprocess.call(['java','-jar','jarFile.jar',-a','input_file','output_file'])
where it writes the result to an external output_file file. and -a is an option.
I now want to analyse output_file in python but want to avoid opening the file again. So I want to run jarFile.jar as a Python function, like:
output=jarFile(input_file)
I have installed JPype and got it working, I have set the class path and started the JVM environment:
import jpype
classpath="/home/me/folder/jarFile.jar"
jpype.startJVM(jpype.getDefaultJVMPath(),"-Djava.class.path=%s"%classpath)
and am now stuck...

java -jar jarFile.jar executes the main method of a class file that is configured in the jar's manifest file.
You find that class name if you extract the jar file's META-INF/MANIFEST.MF (open the jar with any zip tool). Look for the value of Main-Class. If that's for instance com.foo.bar.Application you should be able to call the main method like this
def jarFile(input_file):
# jpype is started as you already did
assert jpype.isJVMStarted()
tf = tempfile.NamedTemporaryFile()
jpype.com.foo.bar.Application.main(['-a', input_file, tf.name])
return tf
(I'm not sure about the correct use of the tempfile module, please check yourself)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading files from Google Storage using Spark (Python) and Dataproc - python

Related

How to print to terminal like git log?

Nextflow - Channel.watchPath() method

Python os.listdir() doesn't return some files

Adding a JAR file to Python Script

Calling a jar file from Python using JPype-total newbie query

Categories

Resources