h2o and setting destination frame from python

h2o and setting destination frame from python - python

We use python to talk to single instance h2o (latest version 3.22.1.1).
Sometimes we get this error:
DistributedException from /10.192.21.17:54321: 'class water.fvec.Frame s3a://BUCKET_NAME/part-00001-0cd59acc-d03f-4af6-8227-e58bf7ad9562-c000.snappy.parquet is already in use. Unable to use it now. Consider using a different destination name.', caused by java.lang.IllegalArgumentException: class water.fvec.Frame s3a://BUCKET_NAME/part-00001-0cd59acc-d03f-4af6-8227-e58bf7ad9562-c000.snappy.parquet is already in use. Unable to use it now. Consider using a different destination name.
at water.MRTask.getResult(MRTask.java:478)
at water.MRTask.getResult(MRTask.java:486)
at water.MRTask.doAll(MRTask.java:402)
We tried to pass our random destination_frame like this:
h2o.import_file(
path=data_path,
destination_frame='frame_{}'.format(str(uuid.uuid4())))
but it looks like destination_frame parameters is not used by H2O even though we see it present in the logs:
POST /3/Parse, parms: {number_columns=94, source_frames=["s3a://BUCKET_NAME/part-00000-0cd59acc-d03f-4af6-8227-e58bf7ad9562-c000.snappy.parquet"], column_types=["UUID","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Enum","Enum","Time","Numeric","Enum","Enum","Time","Time","Numeric","Enum","Enum","Numeric","Enum","Numeric","Numeric","Numeric","Enum","Enum","Enum","Enum","Enum","Numeric","Enum","Enum","Numeric","Enum","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Time","Numeric","Enum","Enum","Time","Numeric","Numeric","Enum","Enum","Enum","Enum","Enum","Numeric","Enum","Numeric","Enum","Numeric","Enum","Numeric","Enum","Numeric","Enum","Numeric","Numeric","Numeric","Numeric","UUID","Time","Numeric","Numeric","Enum","Numeric","Numeric","Numeric","Enum","Numeric","Numeric","Enum","Enum","Numeric","UUID","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Enum","Numeric","Numeric","Numeric"], single_quotes=True, parse_type=PARQUET, destination_frame=frame_19d32a0b-812f-4179-ba83-c3e1afe1d84f, column_names=[
"ALL_COLUMN_NAMES_HERE"], delete_on_done=True, check_header=1, separator=124, blocking=False, chunk_size=77450}

Related

Flyte 0.16.2: Error loading Blob - How to get Types.Blob.fetch() to work in task decorated function?

I have a Flyte task function like this:
#task
def do_stuff(framework_obj):
framework_obj.get_outputs() # This calls Types.Blob.fetch(some_uri)
Trying to load a blob URI using flytekit.sdk.types.Types.Blob.fetch, but getting this error:
ERROR:flytekit: Exception when executing No temporary file system is present. Either call this method from within the context of a task or surround with a 'with LocalTestFileSystem():' block. Or specify a path when calling this function. Note: Cleanup is not automatic when a path is specified.
I can confirm I can load blobs using with LocalTestFileSystem(), in tests, but when actually trying to run a workflow, I'm not sure why I'm getting this error, as the function that calls blob-processing is decorated with #task so it's definitely a Flyte Task. I also confirmed that the task node exists on the Flyte web console.
What path is the error referencing and how do I call this function appropriately?
Using Flyte Version 0.16.2

Could you please give a bit more information about the code? This is flytekit version 0.15.x? I'm a bit confused since that version shouldn't have the #task decorator. It should only have #python_task which is an older API. If you want to use the new python native typing API you should install flytekit==0.17.0 instead.
Also, could you point to the documentation you're looking at? We've updated the docs a fair amount recently, maybe there's some confusion around that. These are the examples worth looking at. There's also two new Python classes, FlyteFile and FlyteDirectory that have replaced the Blob class in flytekit (though that remains what the IDL type is called).
(would've left this as a comment but I don't have the reputation to yet.)
Some code to help with fetching outputs and reading from a file output
#task
def task_file_reader():
client = SynchronousFlyteClient("flyteadmin.flyte.svc.cluster.local:81", insecure=True)
exec_id = WorkflowExecutionIdentifier(
domain="development",
project="flytesnacks",
name="iaok0qy6k1",
)
data = client.get_execution_data(exec_id)
lit = data.full_outputs.literals["o0"]
ctx = FlyteContext.current_context()
ff = TypeEngine.to_python_value(ctx, lv=lit,
expected_python_type=FlyteFile)
with open(ff, 'rb') as fh:
print(fh.readlines())

RuntimeError passing a yaml profile to Essentia MusicExtractor

I am trying to generate a feature set with the Essentia MusicExtractor from a yaml profile as described in the documentation here and here via python.
My code snippet:
from essentia.standard import MusicExtractor
profile = "some_profile.yaml"
audio = "some_audio.mp3"
features, frames = MusicExtractor(profile=profile)(audio)
My yaml profile:
This produces the folling error:
RuntimeError:
Error while configuring MusicExtractor:
Pool: Cannot set/add/merge value to the pool under the name 'rhythm.stats'
because that name already exists but contains a different data type than value.
It does not really look that i am doing something wrong.

I ran into the same problem and fixed it this way:
Downloaded a sample profile from the essentia repos examples.
Ran the profile.
Commented out the conflicting lines after each run, which are just a few. Basically the stats and statsMFCC lines.
From this I could derive a working profile.

ipython-cypher in Python: cypher.run.Connection object parameter

I'm trying to use ipython-cypher to run Neo4j Cypher queries (and return a Pandas dataframe) in a Python program. I have no trouble forming a connection and running a query when using IPython Notebook, but when I try to run the same query outside of IPython, as per the documentation:
http://ipython-cypher.readthedocs.org/en/latest/introduction.html#usage-out-of-ipython
import cypher
results = cypher.run("MATCH (n)--(m) RETURN n.username, count(m) as neighbors",
"http://XXX.XXX.X.XXX:xxxx")
I get the following error:
neo4jrestclient.exceptions.StatusException: Code [401]: Unauthorized. No permission -- see authorization schemes.
Authorization Required
and
Format: (http|https)://username:password#hostname:port/db/name, or one of dict_keys([])
Now, I was just guessing that that was how I should enter a Connection object as the last parameter, because I couldn't find any additional documentation explaining how to connect to a remote host using Python, and in IPython, I am able to do:
%load_ext cypher
results = %cypher http://XXX.XXX.X.XXX:xxxx MATCH (n)--(m) RETURN n.username,
count(m) as neighbors
Any insight would be greatly appreciated. Thank you.

The documentation has a section for the API. When used outside of IPython and in need to connect to a different host, just using the parameter conn and passing a string should work.
import cypher
results = cypher.run("MATCH (n)--(m) RETURN n.username, count(m) as neighbors",
conn="http://XXX.XXX.X.XXX:xxxx")
But also consider that with the new authentication support in Neo4j 2.2, you need to set the new password before connecting from ipython-cypher. I will fix this as soon as I implement the forcing password change mechanism in neo4jrestclient, the library underneath.

Aerospike Python Client. UDF module to count records. Cannot register module

I am currently implementing the Aerospike Python Client in order to benchmark it along with our Redis implementation, to see which is faster and/or more stable.
I'm still on baby steps, currently Unit-Testing basic functionality, for example if I correctly add records in my Set. For that reason, I want to create a function to count them.
I saw in Aerospike's Documentation, that :
"to perform an aggregation on query, you first need to register a UDF
with the database".
It seems that this is the suggested way that aggregations, counting and other custom functionality should be run in Aerospike.
Therefore, to count the records in a set I have, I created the following module:
# "counter.lua"
function count(s)
return s : map(function() return 1 end) : reduce (function(a,b) return a+b end)
end
I'm trying to use aerospike python client's function to register a UDF(User Defined Function) module:
udf_put(filename, udf_type, policy)
My code is as follows:
# aerospike_client.py:
# "udf_put" parameters
policy = {'timeout': 1000}
lua_module = os.path.join(os.path.dirname(os.path.realpath(__file__)), "counter.lua") #same folder
udf_type = aerospike.UDF_TYPE_LUA # equals to "0", which is for "Lua"
self.client.udf_put(lua_module, udf_type, policy) # Exception is thrown here
query = self.client.query(self.aero_namespace, self.aero_set)
query.select()
result = query.apply('counter', 'count')
an exception is thrown:
exceptions.Exception: (-2L, 'Filename should be a string', 'src/main/client/udf.c', 82)
Is there anything I'm missing or doing wrong?
Is there a way to "debug" it without compiling C code?
Is there any other suggested way to count the records in my set? Or I'm fine with the Lua module?

First, I'm not seeing that exception, but I am seeing a bug with udf_put where the module is registered but the python process hangs. I can see the module appear on the server using AQL's show modules.
I opened a bug with the Python client's repo on Github, aerospike/aerospike-client-python.
There's a best practices document regarding UDF development here: https://www.aerospike.com/docs/udf/best_practices.html
In general using the stream-UDF to aggregate the records through the count function is the correct way to go about it. There are examples here and here.

Loading a document on OpenOffice using an external Python program

I'm trying to create a python program (using pyUNO ) to make some changes on a OpenOffice calc sheet.
I've launched previously OpenOffice on "accept" mode to be able to connect from an external program. Apparently, should be as easy as:
import uno
# get the uno component context from the PyUNO runtime
localContext = uno.getComponentContext()
# create the UnoUrlResolver
resolver = localContext.ServiceManager.createInstanceWithContext(
"com.sun.star.bridge.UnoUrlResolver", localContext)
# connect to the running office
ctx = resolver.resolve("uno:socket,host=localhost,port=2002;"
"urp;StarOffice.ComponentContext")
smgr = ctx.ServiceManager
# get the central desktop object
DESKTOP =smgr.createInstanceWithContext("com.sun.star.frame.Desktop", ctx)
#The calling it's not exactly this way, just to simplify the code
DESKTOP.loadComponentFromURL('file.ods')
But I get an AttributeError when I try to access loadComponentFromURL. If I make a dir(DESKTOP), I've see only the following attributes/methods:
['ActiveFrame', 'DispatchRecorderSupplier', 'ImplementationId', 'ImplementationName',
'IsPlugged', 'PropertySetInfo', 'SupportedServiceNames', 'SuspendQuickstartVeto',
'Title', 'Types', 'addEventListener', 'addPropertyChangeListener',
'addVetoableChangeListener', 'dispose', 'disposing', 'getImplementationId',
'getImplementationName', 'getPropertySetInfo', 'getPropertyValue',
'getSupportedServiceNames', 'getTypes', 'handle', 'queryInterface',
'removeEventListener', 'removePropertyChangeListener', 'removeVetoableChangeListener',
'setPropertyValue', 'supportsService']
I've read that there are where a bug doing the same, but on OpenOffice 3.0 (I'm using OpenOffice 3.1 over Red Hat5.3). I've tried to use the workaround stated here, but they don't seems to be working.
Any ideas?

It has been a long time since I did anything with PyUNO, but looking at the code that worked last time I ran it back in '06, I did my load document like this:
def urlify(path):
return uno.systemPathToFileUrl(os.path.realpath(path))
desktop.loadComponentFromURL(
urlify(tempfilename), "_blank", 0, ())
Your example is a simplified version, and I'm not sure if you've removed the extra arguments intentionally or not intentionally.
If loadComponentFromURL isn't there, then the API has changed or there's something else wrong, I've read through your code and it looks like you're doing all the same things I have.
I don't believe that the dir() of the methods on the desktop object will be useful, as I think there's a __getattr__ method being used to proxy through the requests, and all the methods you've printed out are utility methods used for the stand-in object for the com.sun.star.frame.Desktop.
I think perhaps the failure could be that there's no method named loadComponentFromURL that has exactly 1 argument. Perhaps giving the 4 argument version will result in the method being found and used. This could simply be an impedance mismatch between Python and Java, where Java has call-signature method overloading.

This looks like issue 90701: http://www.openoffice.org/issues/show_bug.cgi?id=90701
See also http://piiis.blogspot.com/2008/10/pyuno-broken-in-ooo-30-with-system.html and http://udk.openoffice.org/python/python-bridge.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

h2o and setting destination frame from python - python

Related

Flyte 0.16.2: Error loading Blob - How to get Types.Blob.fetch() to work in task decorated function?

RuntimeError passing a yaml profile to Essentia MusicExtractor

ipython-cypher in Python: cypher.run.Connection object parameter

Aerospike Python Client. UDF module to count records. Cannot register module

Loading a document on OpenOffice using an external Python program

Categories

Resources