Find documents which contain a particular value - Mongo, Python

Find documents which contain a particular value - Mongo, Python - python

I'm trying to add a search option to my website but it doesn't work. I looked up solutions but they all refer to using an actual string, whereas in my case I'm using a variable, and I can't make those solutions work. Here is my code:
cursor = source.find({'title': search_term}).limit(25)
for document in cursor:
result_list.append(document)
Unfortunately this only gives back results which match the search_term variable's value exactly. I want it to give back any results where the title contains the search term - regardless what other strings it contains. How can I do it if I want to pass a variable to it, and not an actual string? Thanks.

You can use $regex to do contains searches.
cursor = collection.find({'field': {'$regex':'regular expression'}})
And to make it case insensitive:
cursor = collection.find({'field': {'$regex':'regular expression', '$options'‌:'i'}})
Please try cursor = source.find({'title': {'$regex':search_term}}).limit(25)

$text
You can perform a text search using $text & $search. You first need to set a text index, then use it:
$ db.docs.createIndex( { title: "text" } )
$ db.docs.find( { $text: { $search: "search_term" } } )
$regex
You may also use $regex, as answered here: https://stackoverflow.com/a/10616781/641627
$ db.users.findOne({"username" : {$regex : ".*son.*"}});
Both solutions compared
Full Text Search vs. Regular Expressions
... The regular expression search takes longer for queries with just a
few results while the full text search gets faster and is clearly
superior in those cases.

Related

Write a custom JSON interpreter for a file that looks like json but isnt using Python

What I need to do is to write a module that can read and write files that use the PDX script language. This language looks alot like json but has enough differences that a custom encoder/decoder is needed to do anything with those files (without a mess of regex substitutions which would make maintenance hell). I originally went with just reading them as txt files, and use regex to find and replace things to convert it to valid json. This lead me to my current point, where any additions to the code requires me to write far more code than I would want to, just to support some small new thing. So using a custom json thing I could write code that shows what valid key:value pairs are, then use that to handle the files. To me that will be alot less code and alot easier to maintain.
So what does this code look like? In general it looks like this (tried to put all possible syntax, this is not an example of a working file):
#key = value # this is the definition for the scripted variable
key = {
# This is a comment. No multiline comments
function # This is a single key, usually optimize_memory
# These are the accepted key:value pairs. The quoted version is being phased out
key = "value"
key = value
key = #key # This key is using a scripted variable, defined either in the file its in or in the `scripted_variables` folder. (see above for example on how these are initially defined)
# type is what the key type is. Like trigger:planet_stability where planet_stability is a trigger
key = type:key
# Variables like this allow for custom names to be set. Mostly used for flags and such things
[[VARIABLE_NAME]
math_key = $VARIABLE_NAME$
]
# this is inline math, I dont actually understand how this works in the script language yet as its new. The "<" can be replaced with any math symbol.
# Valid example: planet_stability < #[ stabilitylevel2 + 10 ]
key < #[ key + 10 ]
# This is used alot to handle code blocks. Valid example:
# potential = {
# exists = owner
# owner = {
# has_country_flag = flag_name
# }
# }
key = {
key = value
}
# This is just a list. Inline brackets are used alot which annoys me...
key = { value value }
}
The major differences between json and PDX script is the nearly complete lack of quotations, using an equals sign instead of a colon for separation and no comma's at the end of the lines. Now before you ask me to change the PDX code, I cant. Its not mine. This is what I have to work with and cant make any changes to the syntax. And no I dont want to convert back and forth as I have already mentioned this would require alot of work. I have attempted to look for examples of this, however all I can find are references to convert already valid json to a python object, which is not what I want. So I cant give any examples of what I have already done, as I cant find anywhere to even start.
Some additional info:
Order of key:value pairs does not technically matter, however it is expected to be in a certain order, and when not in that order causes issues with mods and conflict solvers
bool properties always use yes or no rather than true or false
Lowercase is expected and in some cases required
Math operators are used as separators as well, eg >=, <= ect
The list of syntax is not exhaustive, but should contain most of the syntax used in the language
Past work:
My last attempts at this all revolved around converting it from a text file to a json file. This was alot of work just to get a small piece of this to work.
Example:
potential = {
exists = owner
owner = {
is_regular_empire = yes
is_fallen_empire = no
}
NOR = {
has_modifier = resort_colony
has_modifier = slave_colony
uses_habitat_capitals = yes
}
}
And what i did to get most of the way to json (couldnt find a way to add quotes)
test_string = test_string.replace("\n", ",")
test_string = test_string.replace("{,", "{")
test_string = test_string.replace("{", "{\n")
test_string = test_string.replace(",", ",\n")
test_string = test_string.replace("}, ", "},\n")
test_string = "{\n" + test_string + "\n}"
# Replace the equals sign with a colon
test_string = test_string.replace(" =", ":")
This resulted in this:
{
potential: {
exists: owner,
owner: {
is_regular_empire: yes,
is_fallen_empire: no,
},
NOR: {
has_modifier: resort_colony,
has_modifier: slave_colony,
uses_habitat_capitals: yes,
},
}
}
Very very close yes, but in no way could I find a way to add the quotations to each word (I think I did try a regex sub, but wasnt able to get it to work, since this whole thing is just one unbroken string), making this attempt stuck and also showing just how much work is required just to get a very simple potential block to mostly work. However this is not the method I want anymore, one because its alot of work and two because I couldnt find anything to finish it. So a custom json interpreter is what I want.

The classical approach (potentially leading to more code, but also more "correctness"/elegance) is probably to build a "recursive descent parser", from a bunch of conditionals/checks, loops and (sometimes recursive?) functions/handlers to deal with each of the encountered elements/characters on the input stream. An implicit parse/call tree might be sufficient if you directly output/print the JSON equivalent, or otherwise you could also create a representation/model in memory for later output/conversion.
Related book recommendation could be "Language Implementation Patterns" by Terence Parr, me avoiding to promote my own interpreters and introductory materials :-) In case you need further help, maybe write me?

str.format a list by joining its values

Say I have a dictionary:
data = {
"user" : {
"properties" : ["i1", "i2"]
}
}
And the following string:
txt = "The user has properties {user[properties]}"
I want to have:
txt.format(**data)
to equal:
The user has properties i1, i2
I believe to achieve this, I could subclass the formatter used by str.format but I am unfortunately unsure how to proceed. I rarely subclass standard Python classes. Note that writing {user[properties][0]}, {user[properties][1]} is not an ideal option for me here. I don't know how many items are in the list so I would need to do a regex to identify matches, then find the relevant value in data and replace the matched text with {user[properties][0]}, {user[properties][1]}. str.format takes care of all the indexing from the string's value so it is very practical.

Just join the items in data["user"]["properties"]
txt = "The user has properties {properties}"
txt.format(properties = ", ".join(data["user"]["properties"]))
Here you have a live example

I ended up using the jinja2 package for all of my formatting needs. It's extremely powerful and I really recommend it!

Zapier: Code not returning all values expected

I am working with Code by Zapier and having trouble telling if my regex is wrong or some other part is wrong ( I think the latter)
I am pulling a URL, this URL has several 9 digit IDs that get appended to the end of the URL. I was told to try and extract these IDs and rebuild the URLs so we can post API calls for each of them.
I am a Python newb but so far I have this but it only returns the first 9 digit ID, I am hoping for an array so I can rebuild the URLS with each specific ID. This is my code so far:
import re
urls = re.findall(r'\b\d+\b', input_data['input'])
return [
{'url': url} for url in urls
]
The input _data would be "https://api.helpscout.net/v1/conversations/123456789,098765432.json"
As I said, it just returns the first ID. I know I don't have the code to rebuild the URLs or anything just trying to take it one step at a time!
IS my regex incorrect or the way I am returning them? Thanks!

David here, from the Zapier Platform team. I've got good news and bad news.
Good news is, your regex works! So no sweat there. Downside, you're running into a weird corner case with code steps. It's not documented because we don't encourage its use (it's confusing, as you can tell). When you return an array from a code step, it functions like a trigger. That is, subsequent steps run for each item in the array (but the UI only shows the first).
If that's the desired behavior, you can safely ignore the weird test and complete the zap. If you don't want to fan out, you should instead parse out the comma separated string and act on it later.
If you need more direction, let me know a bit about what your other actions are and I can advise from there.
Side note, the reason you're seeing the error message with Willem's function above is that your python code must either set the output variable to a value or return an object. Either of return get_digits(input_data['input']) or output = get_digits(input_data['input']) should work.

The code works correct on my machine:
import re
def get_digits(s):
return [{'url':url} for url in re.findall(r'\b\d+\b',s)]
If I then call this with the sample input, I get:
>>> get_digits("https://api.helpscout.net/v1/conversations/123456789,098765432.json")
[{'url': '123456789'}, {'url': '098765432'}]
So a list with two dictionaries. Each dictionary contains one key, a 'url' that is associated with the string that contains one or more digits.
In case you want to match only nine-digit sequences, you can make the regex more restrictive (but this can only decrease the number of matches):
import re
def get_digits(s):
return [{'url':url} for url in re.findall(r'\b\d{9}\b',s)]

Argh so frustrating, I decided to try JavaScript and multiple methods don't output anything
<script>
var str = "https://api.helpscout.net/v1/conversations/382411278,374879346,374879343.json";
var tickets = str.match(/\d{9}/g);
for(var i = 0; i<tickets.length; i++)
{
document.write("https://api.helpscout.net/v1/conversations/"+tickets[i]+".json</br>")
}
or
<p id="demo"></p>
<script>
function myFunction() {
var str = "https://api.helpscout.net/v1/conversations/382411278,374879346,374879343.json";
var tickets = str.match(/\d{9}/g);
for(var i = 0; i<tickets.length; i++)
{
document.getElementById("demo").innerHTML +="https://api.helpscout.net/v1/conversations/" +tickets[i] + "<br />"
}
}
</script>

The same query on Sparql gives different results

I read some questions related to my question, like
Same sparql not returning same results, but I think is a little different.
Consider this query which I submit into http://live.dbpedia.org/sparql (Virtuoso endpoint) and get 34 triples as a result. Result Sparql
SELECT ?pred ?obj
WHERE {
<http://dbpedia.org/resource/Johann_Sebastian_Bach> ?pred ?obj
FILTER((langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN"))
)
}
Then, I used the same query in a code in python:
import rdflib
import rdfextras
rdfextras.registerplugins()
g=rdflib.Graph()
g.parse("http://dbpedia.org/resource/Johann_Sebastian_Bach")
PREFIX = """
PREFIX dbp: <http://dbpedia.org/resource/>
"""
query = """
SELECT ?pred ?obj
WHERE {dbp:Johann_Sebastian_Bach ?pred ?obj
FILTER( (langMatches(lang(?obj), "")) ||
(langMatches(lang(?obj), "EN")))}
"""
query = PREFIX + query
result_set = g.query(query)
print len(result_set)
This time, I get only 27 triples! https://dl.dropboxusercontent.com/u/22943656/result.txt
I thought it could be related to the dbpedia site. I repeated these queries several time and always got the same difference. Therefore, I downloaded the RDF file to test it locally, and used the software Protége to simulate the Virtuoso endpoint. Even though, I still have different results from the sparql submitted into Protége and Python, 31 and 27. Is there any explanation for this difference? And how can I get the same result in both?

As the question is written, there are a few possible problems. Based on the comments, the first one described here (about lang, langMatches, etc.) seems to be what you're actually running into, but I'll leave the descriptions of the other possible problems, in case someone else finds them useful.
lang, langMatches, and the empty string
lang is defined to return "" for literals with no language tags. According to RFC 4647 §2.1, language tags are defined as follows:
2.1. Basic Language Range
A "basic language range" has the same syntax as an [RFC3066]
language tag or is the single character "*". The basic language
range was originally described by HTTP/1.1 [RFC2616] and later
[RFC3066]. It is defined by the following ABNF [RFC4234]:
language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*"
alphanum = ALPHA / DIGIT
This means that "" isn't actually a legal language tag. As Jeen Broekstra pointed out on answers.semanticweb.com, the SPARQL recommendation says:
17.2 Filter Evaluation
SPARQL provides a subset of the functions and operators defined by
XQuery Operator Mapping. XQuery 1.0 section 2.2.3 Expression
Processing describes the invocation of XPath functions. The following
rules accommodate the differences in the data and execution models
between XQuery and SPARQL: …
Functions invoked with an
argument of the wrong type will produce a type error. Effective
boolean value arguments (labeled "xsd:boolean (EBV)" in the operator
mapping table below), are coerced to xsd:boolean using the EBV rules
in section 17.2.2.
Since "" isn't a legal language tag, it might be considered "an argument of the wrong type [that] will produce a type error." In that case, the langMatches invocation would produce an error, and that error will be treated as false in the filter expression. Even if it doesn't return false for this reason, RFC 4647 §3.3.1, which describes how language tags and ranges are compared, doesn't say exactly what should happen in the comparison, since it's assuming legal language tags:
Basic filtering compares basic language ranges to language tags. Each
basic language range in the language priority list is considered in
turn, according to priority. A language range matches a particular
language tag if, in a case-insensitive comparison, it exactly equals
the tag, or if it exactly equals a prefix of the tag such that the
first character following the prefix is "-". For example, the
language-range "de-de" (German as used in Germany) matches the
language tag "de-DE-1996" (German as used in Germany, orthography of
1996), but not the language tags "de-Deva" (German as written in the
Devanagari script) or "de-Latn-DE" (German, Latin script, as used in
Germany).
Based on your comments and my local experiments, it appears that langMatches(lang(?obj),"") for literals without language tags (so really, langMatches("","")) is returning true in Virtuoso (as it's installed on DBpedia), Jena's ARQ (from my experiments), and Proégé (from our experiments), and it's returning false (or an error that's coerced to false) in RDFlib.
In either case, since lang is defined to return "" for the literals without a language tag, , you should be able to reliably include them in your results by changing langMatches(lang(?obj),"") with lang(?obj) = "".
Issues with the data that you're using
You're not querying the same data. The data that you download from
http://dbpedia.org/resource/Johann_Sebastian_Bach
is from DBpedia, but when you run a query against
http://live.dbpedia.org/sparql,
you're running it against DBpedia Live, which may have different data. If you run this query on the DBpedia Live endpoint and on the DBpedia endpoint, you get a different number of results:
SELECT count(*) WHERE {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
FILTER( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN" ) )
}
DBpedia Live results 31
DBpedia results 34
Issues with distinct
Another possible problem, though it doesn't seem to be the one that you're running into, is that your second query has a distinct modifier, but your first one doesn't. That means that your second query could easily have fewer results than the first one.
If you run this query against the DBpedia SPARQL endpoint you should get 34 results, and that's the same whether or not you use the distinct modifiers, and it's the number that you should get if you download the data and run the same query against it.
select ?pred ?obj where {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
filter( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN") )
}
SPARQL results

How do I use StandardAnalyzer with TermQuery?

I'm trying to produce something similar to what QueryParser in lucene does, but without the parser, i.e. run a string through StandardAnalyzer, tokenize this and use TermQuery:s in a BooleanQuery to produce a query. My problem is that I only get Token:s from StandardAnalyzer, and not Term:s. I can convert a Token to a term by just extracting the string from it with Token.term(), but this is 2.4.x-only and it seems backwards, because I need to add the field a second time. What is the proper way of producing a TermQuery with StandardAnalyzer?
I'm using pylucene, but I guess the answer is the same for Java etc. Here is the code I've come up with:
from lucene import *
def term_match(self, phrase):
query = BooleanQuery()
sa = StandardAnalyzer()
for token in sa.tokenStream("contents", StringReader(phrase)):
term_query = TermQuery(Term("contents", token.term())
query.add(term_query), BooleanClause.Occur.SHOULD)

The established way to get the token text is with token.termText() - that API's been there forever.
And yes, you'll need to specify a field name to both the Analyzer and the Term; I think that's considered normal. 8-)

I've come across the same problem, and, using Lucene 2.9 API and Java, my code snippet looks like this:
final TokenStream tokenStream = new StandardAnalyzer(Version.LUCENE_29)
.tokenStream( fieldName , new StringReader( value ) );
final List< String > result = new ArrayList< String >();
try {
while ( tokenStream.incrementToken() ) {
final TermAttribute term = ( TermAttribute ) tokenStream.getAttribute( TermAttribute.class );
result.add( term.term() );
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find documents which contain a particular value - Mongo, Python - python

Related

Write a custom JSON interpreter for a file that looks like json but isnt using Python

str.format a list by joining its values

Zapier: Code not returning all values expected

The same query on Sparql gives different results

How do I use StandardAnalyzer with TermQuery?

Categories

Resources