Calculate hash of an h2o frame - python

I would like to calculate some hash value of an h2o.frame.H2OFrame. Ideally, in both R and python. My understanding of h2o.frame.H2OFrame is that these objects basically "live" on the h2o server (i.e., are represented by some Java objects) and not within R or python from where they might have been uploaded.
I want to calculate the hash value "as close as possible" to the actual training algorithm. That rules out calculation of the hash value on (serializations of) the underlying R or python objects, as well as on any underlying files from where the data was loaded.
The reason for this is that I want to capture all (possible) changes that h2o's upload functions perform on the underlying data.
Inferring from the h2o docs, there is no hash-like functionality exposed through h2o.frame.H2OFrame.
One possibility to achieve a hash-like summary of the h2o data is through summing over all numerical columns and doing something similar for categorical columns. However, I would really like to have some avalanche effect in my hash function so that small changes in the function input result in large differences of the output. This requirement rules out simple sums and the like.
Is there already some interface which I might have overlooked?
If not, how could I achieve the task described above?
import h2o
h2o.init()
iris_df=h2o.upload_file(path="~/iris.csv")
# what I would like to achieve
iris_df.hash()
# >>> ab2132nfqf3rf37
# ab2132nfqf3rf37 is the (made up) hash value of iris_df
Thank you for your help.

It is available in the REST API 1 (see screenshot) you can probably get to it in the H2OFrame object in Python as well but it is not directly exposed.

So here a complete solution in python based on Michal Kurka's and Tom Kraljevic's suggestions:
import h2o
import requests
import json
h2o.init()
iris_df = h2o.upload_file(path="~/iris.csv")
apiEndpoint = "http://127.0.0.1:54321/3/Frames/"
res = json.loads(requests.get(apiEndpoint + iris_df.frame_id).text)
print("Checksum 1: ", res["frames"][0]["checksum"])
#change a bit
iris_df[0, 1] = iris_df[0, 1] + 1e-3
res = json.loads(requests.get(apiEndpoint + iris_df.frame_id).text)
print("Checksum 2: ", res["frames"][0]["checksum"])
h2o.cluster().shutdown()
This gives
Checksum 1: 8858396055714143663
Checksum 2: -4953793257165767052
Thanks for your help!

Related

Post-filtering in Weaviate

I have a Weaviate instance running (ver 1.12.2). I am playing around with the Python client https://weaviate-python-client.readthedocs.io/en/stable/ (ver 3.4.2) (add - retrieve - delete objects...etc...)
I am trying to understand how filtered vector search works (outlined here https://weaviate.io/developers/weaviate/current/architecture/prefiltering.html#recall-on-pre-filtered-searches)
When applying pre-filtering, an 'allow-list' of object ids is constructed before carrying out vector search. This is done by using some property to filter out objects.
For example the Where filter I'm using is:
where_filter_1 = {
"path": ["user"],
"operator": "Equal",
"valueText": "billy"
}
This is because I've got many users whose data are kept in this DB and I would like for each user to be able to search their own data. In this case it is Image data.
This is how I implement this using the python client:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_where(where_filter_1)\
.with_near_vector(nearVector)\
.do()
I do not use any Vectorization modules so I create my own vector and pass it to the DB for vector search using .with_near_vector(nearVector) after I have applied the filter with with_where(where_filter_1). This does work as I expect it so I think I'm doing this correctly.
I'm less sure if I'm applying post-filtering correctly:
Each image has some text attached to it. I use the Where filter to search through the text by using the inverted index structure.
where_filter_2 = {
"path": ["image_text"],
"operator": "Like",
"valueText": "Paris France"
}
I apply post filtering like this:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_near_vector(nearVector)\
.with_where(where_filter_2).do()
However, I don't think I'm doing this properly.
A basic inverted index search: (so just searching with text)
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_where(where_filter_2).do()
(Measured with the tqdm module)
Gives me about 5 iters/sec. With 38k objects in the DB
While the post-filtering approach gives me the same performance, at 5 iters/sec
Am I wrong to find this weird? I was expecting performance closer to pure vector search:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_near_vector(nearVector).do()
Which is close to 60 iters/sec (The flat search cut-off is set to 60k, so only brute-force search is used here)
Is the 'Where' filter applied only on the results supplied by the vector search? If so, shouldn't it be much faster? The filter would only be applied to 100 objects at most since that is the default number of results of vector search.
This is kind of confusing. Am I wrong in my understanding of how search works?
Thanks for reading my question !
Your question seems to imply that you are switching between a pre- and post-filtering approach. But as of v1.13 all filtered vector searches are using pre-filtering. There is currently no option for post-filtering. That explains why both your searches have identical results. Your are mostly experiencing the cost of building the filter.
Side-Note 1:
I see that you are using a Like operator. The Like operator only differs from the Equal operator if you are using wildcards. Since you are not using them, you can also use the Equal operator which tends to be more efficient in many cases. (I'm not sure if that applies to your case, but it tends to be true overall)
Side-Note 2:
If you are measuring throughput from a single client thread, i.e. using tqdm from a python script (without using multi-threading), you're not maxing out Weaviate. Since you only start sending the second query once the first has been processed client-side Weaviate will be idle most of the time. If you are interested in the maximum throughput, you need to make sure that you have at least as many client threads as you have cores on the server to max out Weaviate.

How to use PyFFTW's wisom

I didn't see an actual example on pyfftw's documentation of how to use the 'wisdom' feature so I'm a little confused.
My code looks something like the following:
# first FFT
input = pyfftw.zeros_aligned(arraySize, dtype='complex64')
input[:] = image
fftwObj = pyfftw.builders.fft2(input, planner_effort='FFTW_EXHAUSTIVE')
imageFFT = fftwObj(input)
wisdom = pyfftw.export_wisdom()
pyfftw.import_wisdom(wisdom)
# second FFT with the same input size but different input
input = pyfftw.zeros_aligned(arraySize, dtype='complex64')
input[:] = image2
fftwObj = pyfftw.builders.fft2(input, planner_effort='FFTW_EXHAUSTIVE')
imageFFT2 = fftwObj(input)
The docs say that export_wisdom outputs a tuple of strings and that import_wisdom takes in this tuple as an argument.
When am I supposed to export the wisdom and am I supposed to save this tuple out to a file for each FFT?
When do I load it back in? Before the call to each FFT?
Basically, exporting and importing wisdom is a method to maintain state between sessions.
The wisdom is the knowledge about how best to plan an FFT. During a session, the internal "wisdom" is made up of all the plans that are made, and the wisdom that has been imported. Repeatedly importing the same wisdom file is not useful because that knowledge is already known after the first import.
You export wisdom when you want the knowledge about a particular transform plan to be used instead of having to work it out again. It need only plan for that transform once per session though.

Write a collection to a binary file

I'm searching a way to write some data (List, Array, etc) into a binary file. The collection to put into the binary file represents a list of points. What I try until now :
[
11:17]
Welcome to Scala 2.12.0-M3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40).
Type in expressions for evaluation. Or try :help.
scala> import java.io.{FileInputStream, FileOutputStream, ObjectInputStream, ObjectOutputStream}
import java.io.{FileInputStream, FileOutputStream, ObjectInputStream, ObjectOutputStream}
scala> val oos = new ObjectOutputStream(new FileOutputStream("/tmp/f1.data"))
oos: java.io.ObjectOutputStream = java.io.ObjectOutputStream#13bc8645
scala> oos.writeObject(List(1,2,3,4))
scala> oos.close
scala> val ois = new ObjectInputStream(new FileInputStream("/tmp/f1.data"))
ois: java.io.ObjectInputStream = java.io.ObjectInputStream#392a04e7
scala> val s : List[Int] = ois.readObject().asInstanceOf[List[Int]]
s: List[Int] = List(1, 2, 3, 4)
Ok it's working well. The problem is that maybe tomorrow I will need to read this binary file with an another language as Python. Is it a way to have a more generic binary file that can be read by a multiple languages ?
Solution
To the person searching in the same situation, you can do it like that :
def write2binFile(filename : String, a : Array[Int]) = {
val inChannel = new RandomAccessFile(filename, "rw").getChannel
val bbufer = ByteBuffer.allocateDirect(a.length * 4)
val ibuffer = bbufer.asIntBuffer()
ibuffer.put(a)
inChannel.write(bbufer)
inChannel.close
}
Format for cross-platform sharing of point coordinates allowing selective access by RANGE
Your requirements are:
store data by Scala, read by Python (or other languages)
the data are lists of point coordinates
store the data on AWS S3
allow fetching only part of the data using RANGE request
Data structure to use
The data must be uniform in structure and size per element to allow calculating position of certain part by means of RANGE.
If Scala format for storing lists/arrays fulfils this requirement, and if
the binary format is well defined, you may succeed. If not, you have to find
another format.
Reading binary data by Python
Assuming the format is known, use Python struct module from stdlib to read it.
Alternative approach: split data to smaller pieces
You are willing to access the data piece by piece, probably expecting one large object on S3 and using HTTP request with RANGE.
Alternative solution is to split the data into smaller pieces, which are of reasonable size for fetching (e.g. 64 kB, but you know your use case better), and design rule for storing them piece by piece on AWS S3. You may even use tree structure for this purpose.
There are some advantages with this approach:
use whatever format you like, e.g. XML, JSON, no need to deal with special binary formats
pieces can be compressed, you will save some costs
Note, that AWS S3 will charge you not only for data transfer, but also per request, so each HTTP request using RANGE will be counted as one.
Cross-platform binary formats to consider
Consider following formats:
BSON
(Google) Result Buffers
HDF5
If you visit Wikipedia page for any of those formats, you will find many links to other formats.
Anyway, I am not aware of any of such formats, which would be using uniform size per element as most of them are trying to keep the size as small as possible. For this reason they cannot be used in scenario using RANGE unless some special index file is introduced (what is probably not very feasible).
On the other hand, using these formats with alternative approach (splitting the
data to smaller pieces) shall work.
Note: I did some test in past regarding storage efficiency and speed of encoding/decoding. From practical point of view the best results were achieved using simple JSON structure (possibly compressed). You find these options on every platform, it is very simple to use, speed of encoding/decoding is high (I do not say the hightest).

very slow function with two for loops using Arcpy in python

I wrote a code which is working perfectly with the small size data, but when I run it over a dataset with 52000 features, it seems to be stuck in the below function:
def extract_neighboring_OSM_nodes(ref_nodes,cor_nodes):
time_start=time.time()
print "here we start finding neighbors at ", time_start
for ref_node in ref_nodes:
buffered_node = ref_node[2].buffer(10)
for cor_node in cor_nodes:
if cor_node[2].within(buffered_node):
ref_node[4].append(cor_node[0])
cor_node[4].append(ref_node[0])
# node[4][:] = [cor_nodes.index(x) for x in cor_nodes if x[2].within(buffered_node)]
time_end=time.time()
print "neighbor extraction took ", time_end
return ref_nodes
the ref_node and cor_node are a list of tuples as follows:
[(FID, point, geometry, links, neighbors)]
neighbors is an empty list which is going to be populated in the above function.
As I said the last message printed out is the first print command in this function. it seems that this function is so slow but for 52000 thousand features it should not take 24 hours, should it?
Any Idea where the problem would be or how to make the function faster?
You can try multiprocessing, here is an example - http://pythongisandstuff.wordpress.com/2013/07/31/using-arcpy-with-multiprocessing-%E2%80%93-part-3/.
If you want to get K Nearest Neighbors of every (or some, it doesn't matter) sample of a dataset or eps neighborhood of samples, there is no need to implement it yourself. There is libraries out there specially for this purpose.
Once they built the data structure (usually some kind of tree) you can query the data for neighborhood of a certain sample. Usually for high dimensional data these data structure are not as good as they are for low dimensions but there is solutions for high dimensional data as well.
One I can recommend here is KDTree which has a Scipy implementation.
I hope you find it useful as I did.

Python sas7bdat module usage

I have to dump data from SAS datasets. I found a Python module called sas7bdat.py that says it can read SAS .sas7bdat datasets, and I think it would be simpler and more straightforward to do the project in Python rather than SAS due to the other functionality required. However, the help(sas7bdat) in interactive Python is not very useful and the only example I was able to find to dump a dataset is as follows:
import sas7bdat
from sas7bdat import *
# following line is sas dataset to convert
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
#following line is txt file to create
foo.convertFile('/support/textfiles/locked_data.txt','\t')
This doesn't do what I want because a) it uses the SAS variable names as column headers and I need it to use the variable labels, and b) it uses "nan" to denote missing numeric values where I'd rather just leave the value blank.
Can anyone point me to some useful documentation on the methods included in sas7bdat.py? I've Googled every permutation of key words that I could think of, with no luck. If not, can someone give me an example or two of using readColumnAttributes(), readColumnLabels(), and/or readColumnNames()?
Thanks, all.
As time passes, solutions become easier. I think this one is easiest if you want to work with pandas:
import pandas as pd
df = pd.read_sas('/support/sas/locked_data.sas7bdat')
Note that it is easy to get a numpy array by using df.values
This is only a partial answer as I've found no [easy to read] concrete documentation.
You can view the source code here
This shows some basic info regarding what arguments the methods require, such as:
readColumnAttributes(self, colattr)
readColumnLabels(self, collabs, coltext, colcount)
readColumnNames(self, colname, coltext)
I think most of what you are after is stored in the "header" class returned when creating an object with SAS7BDAT. If you just print that class you'll get a lot of info, but you can also access class attributes as well. I think most of what you may be looking for would be under foo.header.cols. I suspect you use various header attributes as parameters for the methods you mention.
Maybe something like this will get you closer?
from sas7bdat import SAS7BDAT
foo = SAS7BDAT(inFile) #your file here...
for i in foo.header.cols:
print '"Atrributes"', i.attr
print '"Labels"', i.label
print '"Name"', i.name
edit: Unrelated to this specific question, but the type() and dir() commands come in handy when trying to figure out what is going on in an unfamiliar class/library
I know I'm late for the answer, but in case someone searches for similar question. The best option is:
import sas7bdat
from sas7bdat import *
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
# This converts to dataframe:
ds = foo.to_data_frame()
Personally I think the better approach would be to export the data using SAS then process the external file as needed using Python.
In SAS, you can do this...
libname datalib "/support/sas";
filename sasdump "/support/textfiles/locked_data.txt";
proc export
data = datalib.locked_data
outfile = sasdump
dbms = tab
label
replace;
run;
The downside to this is that while the column labels are used rather than the variable names, the labels are enclosed in double quotes. When processing in Python, you may need to programmatically remove them if they cause a problem. I hope that helps even though it doesn't use Python like you wanted.

Categories