Post-filtering in Weaviate

Post-filtering in Weaviate - python

I have a Weaviate instance running (ver 1.12.2). I am playing around with the Python client https://weaviate-python-client.readthedocs.io/en/stable/ (ver 3.4.2) (add - retrieve - delete objects...etc...)
I am trying to understand how filtered vector search works (outlined here https://weaviate.io/developers/weaviate/current/architecture/prefiltering.html#recall-on-pre-filtered-searches)
When applying pre-filtering, an 'allow-list' of object ids is constructed before carrying out vector search. This is done by using some property to filter out objects.
For example the Where filter I'm using is:
where_filter_1 = {
"path": ["user"],
"operator": "Equal",
"valueText": "billy"
}
This is because I've got many users whose data are kept in this DB and I would like for each user to be able to search their own data. In this case it is Image data.
This is how I implement this using the python client:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_where(where_filter_1)\
.with_near_vector(nearVector)\
.do()
I do not use any Vectorization modules so I create my own vector and pass it to the DB for vector search using .with_near_vector(nearVector) after I have applied the filter with with_where(where_filter_1). This does work as I expect it so I think I'm doing this correctly.
I'm less sure if I'm applying post-filtering correctly:
Each image has some text attached to it. I use the Where filter to search through the text by using the inverted index structure.
where_filter_2 = {
"path": ["image_text"],
"operator": "Like",
"valueText": "Paris France"
}
I apply post filtering like this:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_near_vector(nearVector)\
.with_where(where_filter_2).do()
However, I don't think I'm doing this properly.
A basic inverted index search: (so just searching with text)
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_where(where_filter_2).do()
(Measured with the tqdm module)
Gives me about 5 iters/sec. With 38k objects in the DB
While the post-filtering approach gives me the same performance, at 5 iters/sec
Am I wrong to find this weird? I was expecting performance closer to pure vector search:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_near_vector(nearVector).do()
Which is close to 60 iters/sec (The flat search cut-off is set to 60k, so only brute-force search is used here)
Is the 'Where' filter applied only on the results supplied by the vector search? If so, shouldn't it be much faster? The filter would only be applied to 100 objects at most since that is the default number of results of vector search.
This is kind of confusing. Am I wrong in my understanding of how search works?
Thanks for reading my question !

Your question seems to imply that you are switching between a pre- and post-filtering approach. But as of v1.13 all filtered vector searches are using pre-filtering. There is currently no option for post-filtering. That explains why both your searches have identical results. Your are mostly experiencing the cost of building the filter.
Side-Note 1:
I see that you are using a Like operator. The Like operator only differs from the Equal operator if you are using wildcards. Since you are not using them, you can also use the Equal operator which tends to be more efficient in many cases. (I'm not sure if that applies to your case, but it tends to be true overall)
Side-Note 2:
If you are measuring throughput from a single client thread, i.e. using tqdm from a python script (without using multi-threading), you're not maxing out Weaviate. Since you only start sending the second query once the first has been processed client-side Weaviate will be idle most of the time. If you are interested in the maximum throughput, you need to make sure that you have at least as many client threads as you have cores on the server to max out Weaviate.

Related

How do I efficiently calculate the mean of nested subsets of Vaex dataframes?

I have a very large dataset comprised of data for several dozen samples, and several hundred subsamples within each sample. I need to get mean, standard deviation, confidence intervals, etc. However, im running into a (suspected) massive performance problem that causes the code to never finish executing. I'll begin by explaining what my actual code does (im not sure how much of the actual code i can share as it is part of an active research project. I hope to open-source but that will depend on IP rules in the agreement) and then i'll share some code that replicates the problem and should hopefully allow somebody a bit more well-versed in Vaex to tell me what im doing wrong!
My code currently calls the "unique()" method on my large vaex dataframe to get a list of samples, and for loops through that list of unique samples. On each loop, it uses the sample number to make an expression representing that sample (so: df[df["sample"] == i] ) and uses unique() on that subset to get a list of subsamples. Then, it uses another for-loop to repeat that process, creating an expression for the subsample and getting the statistical results for that subsample. This isnt the exact code but, in concept, it works like the code block below:
means = {}
list_of_samples = df["sample"].unique()
for sample_number in list_of_samples:
sample = df[ df["sample"] == sample_number ]
list_of_subsamples = sample["subsample"].unique()
means[sample_number] = {}
for subsample_number in list_of_subsamples:
subsample = sample[ sample["subsample"] == subsample_number ]
means[sample_number][subsample_number] = subsample["value"].mean()
If i try to run this code, it hangs on the line means[sample_number][subsample_number] = subsample["value"].mean() and never completes it (not within around an hour, at least) so something is clearly wrong there. To try and diagnose the issue, i have tested the mean function by itself, and in expressions without the looping and other stuff. If I run:
mean = df["value"].mean()
it successfully gives me the mean for the entire "value" column within about 45 seconds. However, if instead i run:
sample = df[ df["sample"] == 1 ]
subsample = sample[ sample["subsample"] == 1 ]
mean = subsample["value"].mean()
The program just hangs. I've left it for an hour and still not gotten a result!
How can i fix this and what am i doing wrong so i can avoid this mistake in the future? If my reading of some discussions regarding vaex are correct, i think i might be able to fix this using vaex "selections", but ive tried to read the documentation on those and cant wrap my head around how i would properly use them here. Any help from a more experienced vaex user would be greatly appreciated!
edit: In case anyone finds this in the future, i was able to fix it by using the groupby method. Im still really curious what was going wrong here, but i'll have to wait until i have more time to investigate it.

Looping can be slow, especially if you have many groups, it's more efficient to rely on built-in grouping:
import vaex
df = vaex.example()
df.groupby(by='id', agg="mean")
# for more complex cases, can use by=['sample', 'sub_sample']

Calculate hash of an h2o frame

I would like to calculate some hash value of an h2o.frame.H2OFrame. Ideally, in both R and python. My understanding of h2o.frame.H2OFrame is that these objects basically "live" on the h2o server (i.e., are represented by some Java objects) and not within R or python from where they might have been uploaded.
I want to calculate the hash value "as close as possible" to the actual training algorithm. That rules out calculation of the hash value on (serializations of) the underlying R or python objects, as well as on any underlying files from where the data was loaded.
The reason for this is that I want to capture all (possible) changes that h2o's upload functions perform on the underlying data.
Inferring from the h2o docs, there is no hash-like functionality exposed through h2o.frame.H2OFrame.
One possibility to achieve a hash-like summary of the h2o data is through summing over all numerical columns and doing something similar for categorical columns. However, I would really like to have some avalanche effect in my hash function so that small changes in the function input result in large differences of the output. This requirement rules out simple sums and the like.
Is there already some interface which I might have overlooked?
If not, how could I achieve the task described above?
import h2o
h2o.init()
iris_df=h2o.upload_file(path="~/iris.csv")
# what I would like to achieve
iris_df.hash()
# >>> ab2132nfqf3rf37
# ab2132nfqf3rf37 is the (made up) hash value of iris_df
Thank you for your help.

It is available in the REST API 1 (see screenshot) you can probably get to it in the H2OFrame object in Python as well but it is not directly exposed.

So here a complete solution in python based on Michal Kurka's and Tom Kraljevic's suggestions:
import h2o
import requests
import json
h2o.init()
iris_df = h2o.upload_file(path="~/iris.csv")
apiEndpoint = "http://127.0.0.1:54321/3/Frames/"
res = json.loads(requests.get(apiEndpoint + iris_df.frame_id).text)
print("Checksum 1: ", res["frames"][0]["checksum"])
#change a bit
iris_df[0, 1] = iris_df[0, 1] + 1e-3
res = json.loads(requests.get(apiEndpoint + iris_df.frame_id).text)
print("Checksum 2: ", res["frames"][0]["checksum"])
h2o.cluster().shutdown()
This gives
Checksum 1: 8858396055714143663
Checksum 2: -4953793257165767052
Thanks for your help!

ravendb python api, query always return 128

I'm querying my ravendb instance. My target collection contain more than 30k documents. I'm using pyravendb with python 3.
I'm querying my index using the following code :
result_ = self.store.database_commands.query(index_name="Raven/DocumentsByEntityName",
index_query=IndexQuery("Tag:MyCollection",total_size=128,skipped_results=start))
if len(result_['Results']) < 128:
return
start being the offset variable that increments by 128 each time I query.
When I run this code the result's length is always 128 which leads to an infinite loop.
Any ideas why it acts like this ?

The problem was the parameter I was using. The proper parameter that should be used is start = offset_that_you_want_to_skip and not skipped_results=offset.
the correct code is the following :
result_ = self.store.database_commands.query(index_name="Raven/DocumentsByEntityName",
index_query=IndexQuery("Tag:MyCollection",total_size=128,skipped_results=0, default_operator=None,start=offset))
#blablabla
offset+=128
if len(result_['Results']) < 128:
return

take a look here in my commit
Get all of a collection's documents id's RavenDB for a "per-document" modification
In pyravendb v3.5.3.5 I updated the IndexQuery and now you able to skip or to take less or more documents then 128.
The other thing don't use total_size or skipped_results (They are going to be removed)

I know this is not exactly going to answer your question but did you consider using RavenDB's streaming functionality? https://ravendb.net/docs/article-page/3.5/csharp/client-api/session/querying/how-to-stream-query-results
In many cases when dealing with a large number of documents this might be faster and simpler compared to iterating with Query().
However please be aware that streamed objects will not be tracked. Meaning changes to these objects and a consequent SaveChanges()-call wont have any effect to the documents stored within RavenDB.

write table cell real-time python

I would like to loop trough a database, find the appropriate values and insert them in the appropriate cell in a separate file. It maybe a csv, or any other human-readable format.
In pseudo-code:
for item in huge_db:
for list_of_objects_to_match:
if itemmatch():
if there_arent_three_matches_yet_in_list():
matches++
result=performoperationonitem()
write_in_file(result, row=object_to_match_id, col=matches)
if matches is 3:
remove_this_object_from_object_to_match_list()
can you think of any way other than going every time through all the outputfile line by line?
I don't even know what to search for...
even better, there are better ways to find three matching objects in a db and have the results in real-time? (the operation will take a while, but I'd like to see the results popping out RT)

Assuming itemmatch() is a reasonably simple function, this will do what I think you want better than your pseudocode:
for match_obj in list_of_objects_to_match:
db_objects = query_db_for_matches(match_obj)
if len(db_objects) >= 3:
result=performoperationonitem()
write_in_file(result, row=match_obj.id, col=matches)
else:
write_blank_line(row=match_obj.id) # if you want
Then the trick becomes writing the query_db_for_matches() function. Without detail, I'll assume you're looking for objects that match in one particular field, call it type. In pymongo such a query would look like:
def query_db_for_matches(match_obj):
return pymongo_collection.find({"type":match_obj.type})
To get this to run efficiently, make sure your database has an index on the field(s) you're querying on by first calling:
pymongo_collection.ensure_index({"type":1})
The first time you call ensure_index it could take a long time for a huge collection. But each time after that it will be fast -- fast enough that you could even put it into query_db_for_matches before your find and it would be fine.

What is the best way to store set data in Python?

I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.

Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.