The same query on Sparql gives different results - python

I read some questions related to my question, like
Same sparql not returning same results, but I think is a little different.
Consider this query which I submit into http://live.dbpedia.org/sparql (Virtuoso endpoint) and get 34 triples as a result. Result Sparql
SELECT ?pred ?obj
WHERE {
<http://dbpedia.org/resource/Johann_Sebastian_Bach> ?pred ?obj
FILTER((langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN"))
)
}
Then, I used the same query in a code in python:
import rdflib
import rdfextras
rdfextras.registerplugins()
g=rdflib.Graph()
g.parse("http://dbpedia.org/resource/Johann_Sebastian_Bach")
PREFIX = """
PREFIX dbp: <http://dbpedia.org/resource/>
"""
query = """
SELECT ?pred ?obj
WHERE {dbp:Johann_Sebastian_Bach ?pred ?obj
FILTER( (langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN")))}
"""
query = PREFIX + query
result_set = g.query(query)
print len(result_set)
This time, I get only 27 triples! https://dl.dropboxusercontent.com/u/22943656/result.txt
I thought it could be related to the dbpedia site. I repeated these queries several time and always got the same difference. Therefore, I downloaded the RDF file to test it locally, and used the software Protége to simulate the Virtuoso endpoint. Even though, I still have different results from the sparql submitted into Protége and Python, 31 and 27. Is there any explanation for this difference? And how can I get the same result in both?

As the question is written, there are a few possible problems. Based on the comments, the first one described here (about lang, langMatches, etc.) seems to be what you're actually running into, but I'll leave the descriptions of the other possible problems, in case someone else finds them useful.
lang, langMatches, and the empty string
lang is defined to return "" for literals with no language tags. According to RFC 4647 §2.1, language tags are defined as follows:
2.1. Basic Language Range
A "basic language range" has the same syntax as an [RFC3066]
language tag or is the single character "*". The basic language
range was originally described by HTTP/1.1 [RFC2616] and later
[RFC3066]. It is defined by the following ABNF [RFC4234]:
language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*"
alphanum = ALPHA / DIGIT
This means that "" isn't actually a legal language tag. As Jeen Broekstra pointed out on answers.semanticweb.com, the SPARQL recommendation says:
17.2 Filter Evaluation
SPARQL provides a subset of the functions and operators defined by
XQuery Operator Mapping. XQuery 1.0 section 2.2.3 Expression
Processing describes the invocation of XPath functions. The following
rules accommodate the differences in the data and execution models
between XQuery and SPARQL: …
Functions invoked with an
argument of the wrong type will produce a type error. Effective
boolean value arguments (labeled "xsd:boolean (EBV)" in the operator
mapping table below), are coerced to xsd:boolean using the EBV rules
in section 17.2.2.
Since "" isn't a legal language tag, it might be considered "an argument of the wrong type [that] will produce a type error." In that case, the langMatches invocation would produce an error, and that error will be treated as false in the filter expression. Even if it doesn't return false for this reason, RFC 4647 §3.3.1, which describes how language tags and ranges are compared, doesn't say exactly what should happen in the comparison, since it's assuming legal language tags:
Basic filtering compares basic language ranges to language tags. Each
basic language range in the language priority list is considered in
turn, according to priority. A language range matches a particular
language tag if, in a case-insensitive comparison, it exactly equals
the tag, or if it exactly equals a prefix of the tag such that the
first character following the prefix is "-". For example, the
language-range "de-de" (German as used in Germany) matches the
language tag "de-DE-1996" (German as used in Germany, orthography of
1996), but not the language tags "de-Deva" (German as written in the
Devanagari script) or "de-Latn-DE" (German, Latin script, as used in
Germany).
Based on your comments and my local experiments, it appears that langMatches(lang(?obj),"") for literals without language tags (so really, langMatches("","")) is returning true in Virtuoso (as it's installed on DBpedia), Jena's ARQ (from my experiments), and Proégé (from our experiments), and it's returning false (or an error that's coerced to false) in RDFlib.
In either case, since lang is defined to return "" for the literals without a language tag, , you should be able to reliably include them in your results by changing langMatches(lang(?obj),"") with lang(?obj) = "".
Issues with the data that you're using
You're not querying the same data. The data that you download from
http://dbpedia.org/resource/Johann_Sebastian_Bach
is from DBpedia, but when you run a query against
http://live.dbpedia.org/sparql,
you're running it against DBpedia Live, which may have different data. If you run this query on the DBpedia Live endpoint and on the DBpedia endpoint, you get a different number of results:
SELECT count(*) WHERE {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
FILTER( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN" ) )
}
DBpedia Live results 31
DBpedia results 34
Issues with distinct
Another possible problem, though it doesn't seem to be the one that you're running into, is that your second query has a distinct modifier, but your first one doesn't. That means that your second query could easily have fewer results than the first one.
If you run this query against the DBpedia SPARQL endpoint you should get 34 results, and that's the same whether or not you use the distinct modifiers, and it's the number that you should get if you download the data and run the same query against it.
select ?pred ?obj where {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
filter( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN") )
}
SPARQL results

Related

How to improve query accuracy of Easticsearch from Python?

How can you improve the accuracy search results from Elasticsearch using the Python wrapper? My basic example returns results, but the results are very inaccurate.
I'm running Elasticsearch 5.2 on Ubuntu 16, and I start by creating my index and adding a few documents like:
es = Elasticsearch()
# Document A
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some specific keywords',
weight=1.0,
data='blah1',
),
)
# Document B
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other specific keywords',
weight=1.0,
data='blah2',
),
)
# Document C
es.index(
index='my-test-index',
doc_type='text',
body=dict(
search_key='some other very long text that is very different yet mentions the word specific and keywords',
weight=1.0,
data='blah3',
),
)
I then query it with:
es = Elasticsearch()
es.indices.create(index='my-test-index', ignore=400)
query = 'some specific keywords'
results = es.search(
index='my-test-index',
body={
'query':{
"function_score": {
"query": {
"match": {
"search_key": query
}
},
"functions": [{
"script_score": {
"script": "doc['weight'].value"
}
}],
"score_mode": "multiply"
}
},
}
)
And although it returns all results, it returns them in the order of documents B, C, A, whereas I would expect them in the order A, B, C, because although all the documents contain all my keywords, only the first one is an exact match. I would expect C to be last because, even though it contains all my keywords, it also contains a lot of fluff I'm not explicitly searching for.
This problem compounds when I index more entries. The search returns everything that has even a single keyword from my query, and seemingly weights them all identically, causing the search results get less and less accurate the larger my index grows.
This is making Elasticsearch almost useless. Is there anyway I can fix it? Is there a problem with my search() call?
In your query, you can use a match_phrase query instead of a match query so that the order and proximity of the search terms get into the mix. Additionally, you can add a small slop in order to allow the terms to be further apart or in a different order. But documents with terms in the same order and closer apart will be ranked higher than documents with terms out of order and/or further apart. Try it out
"query": {
"match_phrase": {
"search_key": query,
"slop": 10
}
},
Note: slop is a number that indicates how many "swaps" of the search terms you need to perform in order to land on the term configuration present in the document.
Sorry for not reading your question more carefully and for the loaded answer below. I don't want to a stick in the mud but I think it will be clearer if you understand a bit more how Elasticsearch itself works.
Because you index your document without specifying any index and mapping configuration, Elasticsearch will use several defaults that it provides out of the box. The indexing process will first tokenize field values in your document using the standard tokenizer and analyze them using the standard analyzer before storing them in the index. Both the standard tokenizer and analyzer work by splitting your string based on word boundary. So at the end of index time, what you have in your index for the terms in the search_key field are ["some", "specific", "keywords"], not "some specific keywords".
During search time, the match query controls relevance using a similarity algorithm called term frequency/inverse document frequency, or TF/IDF. This algorithm is very popular in text search in general and there is a wikipedia section on it: https://en.wikipedia.org/wiki/Tf%E2%80%93idf. What's important to note here is that the more frequently your term appear in the index, the less important it is in terms of relevance. some, specific, and keywords appear in ALL 3 documents in your index, so as far as elasticsearch is concerned, they contribute very little to the document's relevance in your search result. Since A contains only these terms, it's like having a document containing only the, an, a in an English index. It won't show up as first result even if you search for the, an, a specifically. B ranks higher than C because B is shorter, which yields higher norm value. This norm value is explained in the relevance document. This is a bit of a speculation on my part, but I think it does work out this way if you explain the query using the explain API.
So, back to your need, how to favor exact match over everything else? There is, of course, the match_phrase query as Val pointed out. Another popular method to do it, which I personally prefer, is to index the raw value in a nested field called search_key.raw using the not_analyzed option when defining your mapping: https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2 and simply boost this raw value when you search.

Find documents which contain a particular value - Mongo, Python

I'm trying to add a search option to my website but it doesn't work. I looked up solutions but they all refer to using an actual string, whereas in my case I'm using a variable, and I can't make those solutions work. Here is my code:
cursor = source.find({'title': search_term}).limit(25)
for document in cursor:
result_list.append(document)
Unfortunately this only gives back results which match the search_term variable's value exactly. I want it to give back any results where the title contains the search term - regardless what other strings it contains. How can I do it if I want to pass a variable to it, and not an actual string? Thanks.
You can use $regex to do contains searches.
cursor = collection.find({'field': {'$regex':'regular expression'}})
And to make it case insensitive:
cursor = collection.find({'field': {'$regex':'regular expression', '$options'‌​:'i'}})
Please try cursor = source.find({'title': {'$regex':search_term}}).limit(25)
$text
You can perform a text search using $text & $search. You first need to set a text index, then use it:
$ db.docs.createIndex( { title: "text" } )
$ db.docs.find( { $text: { $search: "search_term" } } )
$regex
You may also use $regex, as answered here: https://stackoverflow.com/a/10616781/641627
$ db.users.findOne({"username" : {$regex : ".*son.*"}});
Both solutions compared
Full Text Search vs. Regular Expressions
... The regular expression search takes longer for queries with just a
few results while the full text search gets faster and is clearly
superior in those cases.

OrientDB metadata attributes

I'm trying orientdb with python. I have created a couple of vertices and I noticed that if I prepend their properties name with #, when I search them on the estudio web app, those properties show up in the metadata section.
Thats interesting, so i went and tried to query vertices filtering by a metadata property id that i created.
When doing so:
select from V where #id = somevalue
I get a huge error msg. I couldnt find a way of querying those custom metadata properties.
Your problem is with python/pyorient (I am presuming you are using pyorient based upon your other questions).
Pyorient attempts to map the database fields to object attributes, but python docs say the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9. (source). Thus, you cannot use '#' as a field name and expect pyorient to work.
I went and took a quick peek at pyorient source code, and it turns out the problem is even bigger than above...
pyorient/types.py
elif key[0:1] == '#':
# special case dict
# { '#my_class': { 'accommodation': 'hotel' } }
self.__o_class = key[1:]
for _key, _value in content[key].items():
self.__o_storage[_key] = _value
So pyorient presumes any record field that begins with an '#' character is followed by a class/cluster name, and then a dict of fields. I suppose you could post to the pyorient issue queue and suggest that the above elif section should check if content[key] is a dict, or a "simple value". If it were a dict, it should be handled like it currently is, otherwise it should be handled like a field, but have the # stripped from it.
Ultimately, not using the # symbol in your field names will be the easiest solution.

Compare text to template to detect anomalies (reverse template)

I'm looking for an algorithm or even an algorithm space that deals with the problem of validating that short text (email) matches known templates. Coding will probably be python or perl, but that's flexible.
Here's the problem:
Servers with access to production data need to be able to send out email that will reach the Internet:
Dear John Smith,
We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
$12.34 Spuznitz, LLC on 4/1
$43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal.
Thank you for your business!
Obviously some of the email contents will vary - the salutation ("John Smith"), "$123.45 on 2/4/13", and the lines with transactions printed out. Other parts ("We received your last payment") are very static. I want to be able to match the static portions of the text and quantify that the dynamic portions are within certain reasonable limits (I might know that the most transaction lines to be printed is 5, for example).
Because I am concerned about data exfiltration, I want to make sure email that doesn't match this template never goes out - I want to examine email and quarantine anything that doesn't look like what I expect. So I need to automate this template matching and block any email messages that are far enough away from matching.
So the question is, where do I look for a filtering mechanism? Bayesian filtering tries to verify a sufficient similarity between a specific message and a non-specific corpus, which is kind of the opposite problem. Things like Perl's Template module are a tight match - but for output, not for input or comparison. Simple 'diff' type comparisons won't handle the limited dynamic info very well.
How do I test to see if these outgoing email messages "quack like a duck"?
You could use grammars for a tight matching. It is possible to organize regexps in grammars for easier abstraction: http://www.effectiveperlprogramming.com/blog/1479
Or you could use a dedicated grammar engine Marpa.
If you want a more statistic approach, consider n-grams. First, tokenize text and replace variable chunks by meaningful placeholders, like CURRENCY and DATE. Then, build the n-grams. Now you can use Jaccard index to compare two texts.
Here is a Pure-Perl implementation which works on trigrams:
#!/usr/bin/env perl
use strict;
use utf8;
use warnings;
my $ngram1 = ngram(3, tokenize(<<'TEXT1'));
Dear John Smith,
We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
$12.34 Spuznitz, LLC on 4/1
$43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal.
Thank you for your business!
TEXT1
my $ngram2 = ngram(3, tokenize(<<'TEXT2'));
Dear Sally Bates,
We received your last payment for $456.78 on 6/9/12. We'd like you to be aware of the following charges:
$123,43 Gnomovision on 10/1
As always, you can view these transactions in our portal.
Thank you for your business!
TEXT2
my %intersection =
map { exists $ngram1->[2]{$_} ? ($_ => 1) : () }
keys %{$ngram2->[2]};
my %union =
map { $_ => 1 }
keys %{$ngram1->[2]}, keys %{$ngram2->[2]};
printf "Jaccard similarity coefficient: %0.3f\n", keys(%intersection) / keys(%union);
sub tokenize {
my #words = split m{\s+}x, lc shift;
for (#words) {
s{\d{1,2}/\d{1,2}(?:/\d{2,4})?}{ DATE }gx;
s{\d+(?:\,\d{3})*\.\d{1,2}}{ FLOAT }gx;
s{\d+}{ INTEGER }gx;
s{\$\s(?:FLOAT|INTEGER)\s}{ CURRENCY }gx;
s{^\W+|\W+$}{}gx;
}
return #words;
}
sub ngram {
my ($size, #words) = #_;
--$size;
my $ngram = [];
for (my $j = 0; $j <= $#words; $j++) {
my $k = $j + $size <= $#words ? $j + $size : $#words;
for (my $l = $j; $l <= $k; $l++) {
my #buf;
for my $w (#words[$j..$l]) {
push #buf, $w;
}
++$ngram->[$#buf]{join(' ', #buf)};
}
}
return $ngram;
}
You can use one text as a template and match it against your emails.
Check String::Trigram for an efficient implementation. Google Ngram Viewer is a nice resource to illustrate the n-gram matching.
If you want to match a pre-existing template with e.g. control flow elements like {% for x in y %} against a supposed output from it, you're going to have to parse the template language - which seems like a lot of work.
On the other hand, if you're prepared to write a second template for validation purposes - something like:
Dear {{customer}},
We received your last payment for {{currency}} on {{full-date}}\. We'd like you to be aware of the following charges:
( {{currency}} {{supplier}} on {{short-date}}
){,5}As always, you can view these transactions in our portal\.
... which is just a simple extension of regex syntax, it's pretty straightforward to hack something together that will validate against that:
import re
FIELDS = {
"customer": r"[\w\s\.-]{,50}",
"supplier": r"[\w\s\.,-]{,30}",
"currency": r"[$€£]\d+\.\d{2}",
"short-date": r"\d{,2}/\d{,2}",
"full-date": r"\d{,2}/\d{,2}/\d{2}",
}
def validate(example, template_file):
with open(template_file) as f:
template = f.read()
for tag, pattern in FIELDS.items():
template = template.replace("{{%s}}" % tag, pattern)
valid = re.compile(template + "$")
return (re.match(valid, example) is not None)
The example above isn't the greatest Python code of all time by any means, but it's enough to get the general idea across.
I would go for the "longest common subsequence". A standard implementation can be found here:
http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence
If you need a better algorithm and/or lots of additional ideas for inexact matching of strings, THE standard reference is this book:
http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198
Don't be fooled by the title. Compuatational biology is mostly about matching of large database of long strings (also known as DNA sequences).

How do I use StandardAnalyzer with TermQuery?

I'm trying to produce something similar to what QueryParser in lucene does, but without the parser, i.e. run a string through StandardAnalyzer, tokenize this and use TermQuery:s in a BooleanQuery to produce a query. My problem is that I only get Token:s from StandardAnalyzer, and not Term:s. I can convert a Token to a term by just extracting the string from it with Token.term(), but this is 2.4.x-only and it seems backwards, because I need to add the field a second time. What is the proper way of producing a TermQuery with StandardAnalyzer?
I'm using pylucene, but I guess the answer is the same for Java etc. Here is the code I've come up with:
from lucene import *
def term_match(self, phrase):
query = BooleanQuery()
sa = StandardAnalyzer()
for token in sa.tokenStream("contents", StringReader(phrase)):
term_query = TermQuery(Term("contents", token.term())
query.add(term_query), BooleanClause.Occur.SHOULD)
The established way to get the token text is with token.termText() - that API's been there forever.
And yes, you'll need to specify a field name to both the Analyzer and the Term; I think that's considered normal. 8-)
I've come across the same problem, and, using Lucene 2.9 API and Java, my code snippet looks like this:
final TokenStream tokenStream = new StandardAnalyzer(Version.LUCENE_29)
.tokenStream( fieldName , new StringReader( value ) );
final List< String > result = new ArrayList< String >();
try {
while ( tokenStream.incrementToken() ) {
final TermAttribute term = ( TermAttribute ) tokenStream.getAttribute( TermAttribute.class );
result.add( term.term() );
}

Categories