Implementing "Starts with" and "Ends with" queries with Google App Engine - python

Am wondering if anyone can provide some guidance on how I might implement a starts with or ends with query against a Datastore model using Python?
In pseudo code, it would work something like...
Query for all entities A where property P starts with X
or
Query for all entities B where property P ends with X
Thanks, Matt

You can do a 'starts with' query by using inequality filters:
MyModel.all().filter('prop >=', prefix).filter('prop <', prefix + u'\ufffd')
Doing an 'ends with' query would require storing the reverse of the string, then applying the same tactic as above.

Seems you can't do it for the general case, but can do it for prefix searches (starts with):
Wildcard search on Appengine in python

Related

Python parsing update statements using regex

I'm trying to find a regex expression in python that will be able to handle most of the UPDATE queries that I throw at if from my DB. I can't use sqlparse or any other libraries that may be useful with for this, I can only use python's built-in modules or cx_Oracle, in case it has a method I'm not aware of that could do something like this.
Most update queries look like this:
UPDATE TABLE_NAME SET COLUMN_NAME=2, OTHER_COLUMN=to_date('31-DEC-202023:59:59','DD-MON-YYYYHH24:MI:SS'), COLUMN_STRING='Hello, thanks for your help', UPDATED_BY=-100 WHERE CODE=9999;
Most update queries I use have a version of these types of updates. The output has to be a list including each separate SQL keyword (UPDATE, SET, WHERE), each separate update statement(i.e COLUMN_NAME=2) and the final identifier (CODE=9999).
Ideally, the result would look something like this:
list = ['UPDATE', 'TABLE_NAME', 'SET', 'COLUMN_NAME=2', 'OTHER_COLUMN=("31-DEC-2020 23:59:59","DD-MON-YYYY HH24:MI:SS")', COLUMN_STRING='Hello, thanks for your help', 'UPDATED_BY=-100', 'WHERE', 'CODE=9999']
Initially I tried doing this using a string.split() splitting on the spaces, but when dealing with one of my slightly more complex queries like the one above, the split method doesn't deal well with string updates such as the one I'm trying to make in COLUMN_STRING or those in OTHER_COLUMN due to the blank spaces in those updates.
Let's use the shlex module :
import shlex
test="UPDATE TABLE_NAME SET COLUMN_NAME=2, OTHER_COLUMN=to_date('31-DEC-202023:59:59','DD-MON-YYYYHH24:MI:SS'), COLUMN_STRING='Hello, thanks for your help', UPDATED_BY=-100 WHERE CODE=9999;"
t=shlex.split(test)
Up to here, we won't get rid of comma delimiters and the last semi one, so maybe we can do this :
for i in t:
if i[-1] in [',',';']:
i=i[:-1]
If we print every element of that list we'll get :
UPDATE
TABLE_NAME
SET
COLUMN_NAME=2
OTHER_COLUMN=to_date(31-DEC-202023:59:59,DD-MON-YYYYHH24:MI:SS)
COLUMN_STRING=Hello, thanks for your help
UPDATED_BY=-100
WHERE
CODE=9999
Not a proper generic answer, but serves the purpose i hope.

Django domain + regex parameter not working on production machine

I currently have a django view with a fairly simple search function (takes user input, returns a list of objects). For usability, I'd like the option of passing search paramters via url like so:
www.example.com/search/mysearchstring
Where mysearchstring is the input to the search function. I'm using regex to validate any alphanumeric or underscore characters.
The problem I'm having is that while this works perfectly in my development environment, it breaks on the live machine.
Currently, I am using this exact same method (with different regex patterns) in other django views without any issues. This leads me to believe that either.
1) My regex is truly bad (more likely)
2) There is a difference in regex validators between environments (less likely)
The machine running this is using django 1.6 and python 2.7, which are slightly behind my development machine, but not significantly.
urls.py
SEARCH_REGEX = '(?P<pdom>\w*)?'
urlpatterns = patterns('',
....
url(r'^polls/search/' + SEARCH_REGEX, 'polls.views.search'),
...)
Which are passed to the view like this
views. py
def search(request, pdom):
...
When loading up the page, I get the following error:
ImproperlyConfigured: "^polls/search/(?P<pdom>\w*)?" is not a valid regular expression: nothing to repeat
I've been scratching my head over this one for a while. I've attempted to use a few different methods of encapsulation around the expression with no change in results. Would appreciate any insight!
I would change it to this:
SEARCH_REGEX = r'(?P<pdom>.+)$'
It's usually a good idea to use raw strings r'' for regular expressions in python.
The group will match the entire content of the search part of your url. I would handle query string validation in the view, instead of in the url regex. If someone tries to search polls/search/two+words, you should not return a 404, but instead a 400 status and a error message explaining that the search string was malformed.
Finally, you might want to follow the common convention for search urls. Which is to use a query parameter called q. So your url-pattern would be ^polls/search/$, and then you just handle the q in the view using something like this:
def search_page_view(request):
query_string = request.GET.get('q', '')

Efficient Django full-text search without Haystack

What's the next best option for database-agnostic full-text search for Django without Haystack?
I have a model like:
class Paper(models.Model):
title = models.CharField(max_length=1000)
class Person(models.Model):
name = models.CharField(max_length=100)
class PaperReview(models.Model):
paper = models.ForeignKey(Paper)
person = models.ForeignKey(Person)
I need to search for papers by title and reviewer name, but I also want to search from the perspective of a person and find which papers they have and haven't reviewed. With Haystack, it's trivial to implement a full-text index to search by title and name fields, but as far as I can tell, there's no way to do the "left outer join" necessary to find papers without a review by a specific person.
Haystack is just a wrapper that exposes a few different search engine backends:
Solr
ElasticSearch
Whoosh
Xapian
There might be other backends as well available as plugins.
So the real question here is, is there a search backend that gives me the desired functionality, and does haystack expose that functionality?
The answer to that is, you can probably use elasticsearch*, but note the asterix.
Generally, when creating a search index, it's a good idea to think about the documents in the same way you might if you were creating a no-rel database and you want those documents to be as flat as possible.
So one possibility might be to have an array of char fields on a paperreview index. The array would contain all of the related foreign key references.
Another might be to use "nested documents" in elasticsearch.
And lastly, to use "parent/child documents" in elasticsearch.
You can still use haystack for indexing, with some hacking, but you will probably want to use one of the raw backends directly, such as pyelasticsearch or pyes.
http://www.elasticsearch.org/guide/reference/mapping/nested-type/
http://www.elasticsearch.org/guide/reference/mapping/parent-field/
http://pyelasticsearch.readthedocs.org/en/latest/
http://pyes.readthedocs.org/en/latest/
I know this question is older, but I spent some time investigation this recently and answered this as well here but it is actually not too hard to implement this yourself, and wanted to share.
I found the SearchVector/SearchQuery approach actually does not catch all cases, for example partial words (see https://www.fusionbox.com/blog/detail/partial-word-search-with-postgres-full-text-search-in-django/632/ for reference). You can implement your own without much trouble, depending on your constraints.
example, within a viewsets' get_queryset method:
...other params...
search_terms = self.request.GET.get('q')
if search_terms:
# remove possible other delimiters and other chars
# that could interfere
cleaned_terms = re.sub(r'[!\'()|&;,]', ' ', search_terms).strip()
if cleaned_terms:
# Check against all the params we want
# apply to previous terms' filtered results
q = reduce(
lambda p, n: p & n,
map(
lambda word:
Q(your_property__icontains=word) | Q(
second_property__icontains=word) | Q(
third_property__icontains=word)
cleaned_terms.split()
)
)
qs = YourModel.objects.filter(q)
return qs
I use Haystack + elastic search and so far its working pretty well. Dont think its trivial . You can easily implement your requirement, if theres a association between paper and person.
I ended up using djorm-ext-pgfulltext, which provides a simple Django interface for PostgreSQL's built-in full text search features.

Django-Haystack with Solr contains search

I am using haystack within a project using solr as the backend. I want to be able to perform a contains search, similar to the Django .filter(something__contains="...")
The __startswith option does not suit our needs as it, as the name suggests, looks for words that start with the string.
I tried to use something like *keyword* but Solr does not allow the * to be used as the first character
Thanks.
To get "contains" functionallity you can use:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="100" side="back"/>
<filter class="solr.LowerCaseFilterFactory" />
as index analyzer.
This will create ngrams for every whitespace separated word in your field. For example:
"Index this!" => x, ex, dex, ndex, index, !, s!, is!, his!, this!
As you see this will expand your index greatly but if you now enter a query like:
"nde*"
it will match "ndex" giving you a hit.
Use this approach carefully to make sure that your index doesn't get too large. If you increase minGramSize, or decrease maxGramSize it will not expand the index as mutch but reduce the "contains" functionallity. For instance setting minGramSize="3" will require that you have at least 3 characters in your contains query.
You can achieve the same behavior without having to touch the solr schema. In your index, make your text field an EdgeNgramField instead of a CharField. Under the hood this will generate a similar schema to what lindstromhenrik suggested.
I am using an expression like:
.filter(something__startswith='...')
.filter_or(name=''+s'...')
as is seems solr does not like expression like '...*', but combined with or will do
None of the answers here do a real substring search *keyword*.
They don't find the keyword that is part of a bigger string, (not a prefix or suffix).
Using EdgeNGramFilterFactory or the EdgeNgramField in the indexes can only do a "startswith" or a "endswith" type of filtering.
The solution is to use a NgramField like this:
class MyIndex(indexes.SearchIndex, indexes.Indexable):
...
field_to_index= indexes.NgramField(model_attr='field_name')
...
This is very elegant, because you don't need to manually add anything to the schema.xml

QuerySet: LEFT JOIN with AND

I use old Django version 1.1 with hack, that support join in extra(). It works, but now is time for changes. Django 1.2 use RawQuerySet so I've rewritten my code for that solution. Problem is, that RawQuery doesn't support filters etc. which I have many in code.
Digging through Google, on CaktusGroup I've found, that I could use query.join().
It would be great, but in code I have:
LEFT OUTER JOIN "core_rating" ON
("core_film"."parent_id" = "core_rating"."parent_id"
AND "core_rating"."user_id" = %i
In query.join() I've written first part "core_film"."parent_id" = "core_rating"."parent_id" but I don't know how to add the second part after AND.
Does there exist any solution for Django, that I could use custom JOINs without rewritting all the filters code (Raw)?
This is our current fragment of code in extra()
top_films = top_films.extra(
select=dict(guess_rating='core_rating.guess_rating_alg1'),
join=['LEFT OUTER JOIN "core_rating" ON ("core_film"."parent_id" = "core_rating"."parent_id" and "core_rating"."user_id" = %i)' % user_id] + extra_join,
where=['core_film.parent_id in (select parent_id from core_film EXCEPT select film_id from filmbasket_basketitem where "wishlist" IS NOT NULL and user_id=%i)' % user_id,
'( ("core_rating"."type"=1 AND "core_rating"."rating" IS NULL) OR "core_rating"."user_id" IS NULL)',
' "core_rating"."last_displayed" IS NULL'],
)
Unfortunately, the answer here is no.
The Django ORM, like most of Django, follows a philosophy that easy things should be easy and hard things should be possible. In this case, you are definitely in the "hard things" area and the "possible" solution is to simply write the raw query. There are definitely situations like this where writing the raw query can be difficult and feels kinda gross, but from the project's perspective situations like this are too rare to justify the cost of adding such functionality.
Try this patch: https://code.djangoproject.com/ticket/7231

Categories