Google Cloud Platform Vertex AI logs not showing in custom job - python

I have written a python package that trains a neural network. I then package it up using the below command.
python3 setup.py sdist --formats=gztar
When I run this job through the GCP console, and manually click through all the options, I get logs from my program as expected (see example below)
Example successful logs:
However, when I run the exact same job programmatically, no logs appear. Only the final error (if one occurs):
Example logs missing:
In both cases, the program is running - I just cant see any of the outputs. What could the reason for this be? For reference, I have also included the code I used to programmatically start the training process:
ENTRY_POINT = "projects.yaw_correction.yaw_correction"
TIMESTAMP = datetime.datetime.strftime(datetime.datetime.now(),"%y%m%d_%H%M%S")
PROJECT = "yaw_correction"
GCP_PROJECT = "our_gcp_project_name"
LOCATION = "europe-west1"
BUCKET_NAME = "our_bucket_name"
DISPLAY_NAME = "Training_Job_" + TIMESTAMP
CONTAINER_URI = "europe-docker.pkg.dev/vertex-ai/training/pytorch-xla.1-9:latest"
MODEL_NAME = "Model_" + TIMESTAMP
ARGS = [f"/gcs/fotokite-training-data/yaw_correction/", "--cloud", "--gpu"]
TENSORBOARD = "projects/"our_gcp_project_name"/locations/europe-west4/tensorboards/yaw_correction"
MACHINE_TYPE = "n1-standard-4"
REPLICA_COUNT = 1
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0
SYNC = False
#Delete existing source distributions
def deleteDist():
dirpath = Path('dist')
if dirpath.exists() and dirpath.is_dir():
shutil.rmtree(dirpath)
# Copy distribution to the cloud bucket storage
deleteDist()
subprocess.run("python3 setup.py sdist --formats=gztar", shell=True)
filename = [x for x in Path('dist').glob('*')]
if len(filename) != 1:
raise Exception("More than one distribution was found")
print(str(filename[0]))
PACKAGE_URI = f"gs://{BUCKET_NAME}/distributions/"
subprocess.run(f"gsutil cp {str(filename[0])} {PACKAGE_URI}", shell=True)
PACKAGE_URI += str(filename[0].name)
deleteDist()
# Initialise the compute instance
aiplatform.init(project=GCP_PROJECT, location=LOCATION, staging_bucket=BUCKET_NAME)
# Schedule the job
job = aiplatform.CustomPythonPackageTrainingJob(
display_name=DISPLAY_NAME,
#script_path="trainer/test.py",
python_package_gcs_uri=PACKAGE_URI,
python_module_name=ENTRY_POINT,
#requirements=['tensorflow_datasets~=4.2.0', 'SQLAlchemy~=1.4.26', 'google-cloud-secret-manager~=2.7.2', 'cloud-sql-python-connector==0.4.2', 'PyMySQL==1.0.2'],
container_uri=CONTAINER_URI,
)
model = job.run(
dataset=None,
#base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/Train_{TIMESTAMP}",
base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/",
service_account="vertex-ai-fotokite-service-acc#fotokite-cv-gcp-exploration.iam.gserviceaccount.com",
environment_variables=None,
args=ARGS,
replica_count=REPLICA_COUNT,
machine_type=MACHINE_TYPE,
accelerator_type=ACCELERATOR_TYPE,
accelerator_count=ACCELERATOR_TYPE,
#tensorboard=TENSORBOARD,
sync=SYNC
)
print(model)
print("JOB SUBMITTED")

Regularly this kind of error "The replica workerpool0-0 exited with a non-zero status of 1" is because something is wrong in the process of package the python file or in the code.
You can see these options.
You could check if all the files are in the package(training files and dependencies) like in this
example:
setup.py
demo/PKG
demo/SOURCES.txt
demo/dependency_links.txt
demo/requires.txt
demo/level.txt
trainer/__init__.py
trainer/metadata.py
trainer/model.py
trainer/task.py
trainer/utils.py
You could see the official troubleshooting guide from Google Cloud
with this type of error and how to see more information about this
error.
You can see this oficial documentation about packaging.

Related

Calling gcloud from bazel genrule

I am having some issues getting gcloud to run in a Bazel genrule. Looks like python path related issues.
genrule(
name="foo",
outs=["bar"],
srcs=[":bar.enc"],
cmd="gcloud decrypt --location=global --keyring=foo --key=bar --plaintext-file $# --ciphertext-file $(location bar.enc)"
)
The exception is:
ImportError: No module named traceback
From:
try:
gcloud_main = _import_gcloud_main()
except Exception as err: # pylint: disable=broad-except
# We want to catch *everything* here to display a nice message to the user
# pylint:disable=g-import-not-at-top
import traceback
# We DON'T want to suggest `gcloud components reinstall` here (ex. as
# opposed to the similar message in gcloud_main.py), as we know that no
# commands will work.
sys.stderr.write(
('ERROR: gcloud failed to load: {0}\n{1}\n\n'
'This usually indicates corruption in your gcloud installation or '
'problems with your Python interpreter.\n\n'
'Please verify that the following is the path to a working Python 2.7 '
'executable:\n'
' {2}\n\n'
'If it is not, please set the CLOUDSDK_PYTHON environment variable to '
'point to a working Python 2.7 executable.\n\n'
'If you are still experiencing problems, please reinstall the Cloud '
'SDK using the instructions here:\n'
' https://cloud.google.com/sdk/\n').format(
err,
'\n'.join(traceback.format_exc().splitlines()[2::2]),
sys.executable))
sys.exit(1)
My questions are:
How do I best call gcloud from a genrule?
What are the parameters needed to specify the python path?
How is Bazel blocking this?
Update:
Able to get it to run by specifying the CLOUDSDK_PYTHON.
Indeed, bazel runs in a sandbox, hence gcloud cannot find its dependencies. Acutally, I'm surprised gcloud can be invoked at all.
To proceed, I would wrap gcloud in a bazel py_binary and refer it with tools attribute in the genrule. You also need to wrap it with location in the cmd. In the end, you will have
genrule(
name = "foo",
outs = ["bar"],
srcs = [":bar.enc"],
cmd = "$(location //third_party/google/gcloud) decrypt --location=global --keyring=foo --key=bar --plaintext-file $# --ciphertext-file $(location bar.enc)",
tools = ["//third_party/google/gcloud"],
)
And for that you define in third_party/google/gcloud/BUILD (or anywhere your want, I just used a path that makes sense to me)
py_binary(
name = "gcloud",
srcs = ["gcloud.py"],
main = "gcloud.py",
visibility = ["//visibility:public"],
deps = [
":gcloud_sdk",
],
)
py_library(
name = "gcloud_sdk",
srcs = glob(
["**/*.py"],
exclude = ["gcloud.py"],
# maybe exclude tests and unrelated code, too.
),
deps = [
# Whatever extra deps are needed by gcloud to compile
]
)
I had a similar issue, worked for me running this command:
export CLOUDSDK_PYTHON=/usr/bin/python
(this was answered above as an update but I felt to post the whole command for future people coming here)

Running xlwings on a server

Solved the problem by making sure that python27.dll and win32api into the PYTHON_WIN directory. The two files with an arrow were missing in the server folder (P:/)
I am using udf module of xlwings and setting xlwings.bas configuration into a server. (I have my personal disk C:/ however other users need to run this macro and the file must be stored in P:/)
I have python, xlwings and other modules installed in P:/, but I get some strange error about importing os lib. Does any one have ever passed through this ?
Thanks in advance.
I got this from python:
"Python process exited before it was possible to create the interface object.
Command: P:\Python 27\python.exe -B -c ""import sys, os;sys.path.extend(os.path.normcase(os.path.expandvars(r'P:\Risco\Python Scripts')).split(';'));import xlwings.server; xlwings.server.serve('{2bb649ad-487e-48d5-ab31-24adaeefca59}')""
Working Dir: "
Xlwings.bas is set like this:
PYTHON_WIN = "P:\Python 27\python.exe"
PYTHON_MAC = ""
PYTHON_FROZEN = ThisWorkbook.Path & "\build\exe.win32-2.7"
PYTHONPATH = "P:\Risco\Python Scripts"
UDF_MODULES = "FetchPL"
UDF_DEBUG_SERVER = False
LOG_FILE = ""
SHOW_LOG = True
OPTIMIZED_CONNECTION = False

Link Waf target to a library generated by external build system (CMake)

My waf project has two dependencies, built with CMake.
What I'm trying to do, is following the dynamic_build3 example found in waf git repo, create a tool which spawns CMake and after a successful build, performs an install into waf's output subdirectory:
#extension('.txt')
def spawn_cmake(self, node):
if node.name == 'CMakeLists.txt':
self.cmake_task = self.create_task('CMake', node)
self.cmake_task.name = self.target
#feature('cmake')
#after_method('process_source')
def update_outputs(self):
self.cmake_task.add_target()
class CMake(Task.Task):
color = 'PINK'
def keyword(self):
return 'CMake'
def run(self):
lists_file = self.generator.source[0]
bld_dir = self.generator.bld.bldnode.make_node(self.name)
bld_dir.mkdir()
# process args and append install prefix
try:
cmake_args = self.generator.cmake_args
except AttributeError:
cmake_args = []
cmake_args.append(
'-DCMAKE_INSTALL_PREFIX={}'.format(bld_dir.abspath()))
# execute CMake
cmd = '{cmake} {args} {project_dir}'.format(
cmake=self.env.get_flat('CMAKE'),
args=' '.join(cmake_args),
project_dir=lists_file.parent.abspath())
try:
self.generator.bld.cmd_and_log(
cmd, cwd=bld_dir.abspath(), quiet=Context.BOTH)
except WafError as err:
return err.stderr
# execute make install
try:
self.generator.bld.cmd_and_log(
'make install', cwd=bld_dir.abspath(), quiet=Context.BOTH)
except WafError as err:
return err.stderr
try:
os.stat(self.outputs[0].abspath())
except:
return 'library {} does not exist'.format(self.outputs[0])
# store the signature of the generated library to avoid re-running the
# task without need
self.generator.bld.raw_deps[self.uid()] = [self.signature()] + self.outputs
def add_target(self):
# override the outputs with the library file name
name = self.name
bld_dir = self.generator.bld.bldnode.make_node(name)
lib_file = bld_dir.find_or_declare('lib/{}'.format(
(
self.env.cshlib_PATTERN
if self.generator.lib_type == 'shared' else self.env.cstlib_PATTERN
) % name))
self.set_outputs(lib_file)
def runnable_status(self):
ret = super(CMake, self).runnable_status()
try:
lst = self.generator.bld.raw_deps[self.uid()]
if lst[0] != self.signature():
raise Exception
os.stat(lst[1].abspath())
return Task.SKIP_ME
except:
return Task.RUN_ME
return ret
I'd like to spawn the tool and then link the waf target to the installed libraries, which I perform using the "fake library" mechanism by calling bld.read_shlib():
def build(bld):
bld.post_mode = Build.POST_LAZY
# build 3rd-party CMake dependencies first
for lists_file in bld.env.CMAKE_LISTS:
if 'Chipmunk2D' in lists_file:
bld(
source=lists_file,
features='cmake',
target='chipmunk',
lib_type='shared',
cmake_args=[
'-DBUILD_DEMOS=OFF',
'-DINSTALL_DEMOS=OFF',
'-DBUILD_SHARED=ON',
'-DBUILD_STATIC=OFF',
'-DINSTALL_STATIC=OFF',
'-Wno-dev',
])
bld.add_group()
# after this, specifying `use=['chipmunk']` in the target does the job
out_dir = bld.bldnode.make_node('chipmunk')
bld.read_shlib(
'chipmunk',
paths=[out_dir.make_node('lib')],
export_includes=[out_dir.make_node('include')])
I find this * VERY UGLY * because:
The chipmunk library is needed ONLY during final target's link phase, there's no reason to block the whole build (by using Build.POST_LAZY mode and bld.add_group()), though unblocking it makes read_shlib() fail. Imagine if there was also some kind of git clone task before that...
Calling read_shlib() in build() command implies that the caller knows about how and where the tool installs the files. I'd like the tool itself to perform the call to read_shlib() (if necessary at all). But I failed doing this in run() and in runnable_status(), as suggested paragraph 11.4.2 of Waf Book section about Custom tasks, seems that I have to incapsulate in some way the call to read_shlib() in ANOTHER task and put it inside the undocumented more_tasks attribute.
And there are the questions:
How can I incapsulate the read_shlib() call in a task, to be spawned by the CMake task?
Is it possible to let the tasks go in parallel in a non-blocking way for other tasks (suppose a project has 2 or 3 of these CMake dependencies, which are to be fetched by git from remote repos)?
Well in fact you have already done most of the work :)
read_shlib only create a fake task pretending to build an already existing lib. In your case, you really build the lib, so you really don't need read_shlib. You can just use your cmake task generator somewhere, given that you've set the right parameters.
The keyword use recognizes some parameters in the used task generators:
export_includes
export_defines
It also manage libs and tasks order if the used task generator has a link_task.
So you just have to set the export_includes and export_defines correctly in your cmake task generator, plus set a link_task attribute which reference your cmake_task attribute. You must also set your cmake_task outputs correctly for this to work, ie the first output of the list must be the lib node (what you do in add_target seems ok). Something like:
#feature('cmake')
#after_method('update_outputs')
def export_for_use(self):
self.link_task = self.cmake_task
out_dir = self.bld.bldnode.make_node(self.target)
self.export_includes = out_dir.make_node('include')
This done, you will simply write in your main wscript:
def build(bld):
for lists_file in bld.env.CMAKE_LISTS:
if 'Chipmunk2D' in lists_file:
bld(
source=lists_file,
features='cmake',
target='chipmunk',
lib_type='shared',
cmake_args=[
'-DBUILD_DEMOS=OFF',
'-DINSTALL_DEMOS=OFF',
'-DBUILD_SHARED=ON',
'-DBUILD_STATIC=OFF',
'-DINSTALL_STATIC=OFF',
'-Wno-dev',
])
bld.program(source="main.cpp", use="chipmunk")
You can of course simplify/factorize the code. I think add_target should not be in the task, it manages mainly task generator attributes.

How do you define the host for Fabric to push to GitHub?

Original Question
I've got some python scripts which have been using Amazon S3 to upload screenshots taken following Selenium tests within the script.
Now we're moving from S3 to use GitHub so I've found GitPython but can't see how you use it to actually commit to the local repo and push to the server.
My script builds a directory structure similar to \images\228M\View_Use_Case\1.png in the workspace and when uploading to S3 it was a simple process;
for root, dirs, files in os.walk(imagesPath):
for name in files:
filename = os.path.join(root, name)
k = bucket.new_key('{0}/{1}/{2}'.format(revisionNumber, images_process, name)) # returns a new key object
k.set_contents_from_filename(filename, policy='public-read') # opens local file buffers to key on S3
k.set_metadata('Content-Type', 'image/png')
Is there something similar for this or is there something as simple as a bash type git add images command in GitPython that I've completely missed?
Updated with Fabric
So I've installed Fabric on kracekumar's recommendation but I can't find docs on how to define the (GitHub) hosts.
My script is pretty simple to just try and get the upload to work;
from __future__ import with_statement
from fabric.api import *
from fabric.contrib.console import confirm
import os
def git_server():
env.hosts = ['github.com']
env.user = 'git'
env.passowrd = 'password'
def test():
process = 'View Employee'
os.chdir('\Work\BPTRTI\main\employer_toolkit')
with cd('\Work\BPTRTI\main\employer_toolkit'):
result = local('ant viewEmployee_git')
if result.failed and not confirm("Tests failed. Continue anyway?"):
abort("Aborting at user request.")
def deploy():
process = "View Employee"
os.chdir('\Documents and Settings\markw\GitTest')
with cd('\Documents and Settings\markw\GitTest'):
local('git add images')
local('git commit -m "Latest Selenium screenshots for %s"' % (process))
local('git push -u origin master')
def viewEmployee():
#test()
deploy()
It Works \o/ Hurrah.
You should look into Fabric. http://docs.fabfile.org/en/1.4.1/index.html. Automated server deployment tool. I have been using this quite some time, it works pretty fine.
Here is my one of the application which uses it, https://github.com/kracekumar/sachintweets/blob/master/fabfile.py
It looks like you can do this:
index = repo.index
index.add(['images'])
new_commit = index.commit("my commit message")
and then, assuming you have origin as the default remote:
origin = repo.remotes.origin
origin.push()

Controlling VirtualBox from commandline with python

We are using python virtualbox API for controlling the virtualbox. For that we are using the "pyvb" package(as given in python API documentation).
al=pyvb.vb.VB()
m=pyvb.vm.vbVM()
al.startVM(m)
we have executed using the python interpreter. No error is shown but the virtualbox doesnt start. Could you please tell us what could be wrong(all necessary modules and packages have been imported)
I found that I can use the following functions to find if a VM is running, restore a VM to a specific snapshot, and start a VM by name.
from subprocess import Popen, PIPE
def running_vms():
"""
Return list of running vms
"""
f = Popen(r'vboxmanage --nologo list runningvms', stdout=PIPE).stdout
data = [ eachLine.strip() for eachLine in f ]
return data
def restore_vm(name='', snapshot=''):
"""
Restore VM to specific snapshot uuid
name = VM Name
snapshot = uuid of snapshot (uuid can be found in the xml file of your machines folder)
"""
command = r'vboxmanage --nologo snapshot %s restore %s' % (name,snapshot)
f = Popen(command, stdout=PIPE).stdout
data = [ eachLine.strip() for eachLine in f ]
return data
def launch_vm(name=''):
"""
Launch VM
name = VM Name
"""
command = r'vboxmanage --nologo startvm %s ' % name
f = Popen(command, stdout=PIPE).stdout
data = [ eachLine.strip() for eachLine in f ]
return data
The code quoted doesn't seem to specify what VM to run. Shouldn't you be doing a getVM call and then using that resulting VM instance in your startVM call? E.g.:
al=pyvb.vb.VB()
m=al.getVM(guid_of_vm)
al.startVM(m)
...would start the VM identified with the given GUID (all VirtualBox VMs have a GUID assigned when they're created). You can get the GUID from the VM's XML file. If you need to discover VMs at runtime, there's the handy listVMS call:
al=pyvb.vb.VB()
l=al.listVMS()
# choose a VM from the list, assign to 'm'
al.startVM(m)

Categories