Django, SQLite - Accurate ordering of strings with accented letters

Django, SQLite - Accurate ordering of strings with accented letters - python

Main problem:
I have a Python (3.4) Django (1.6) web app using an SQLite (3) database containing a table of authors. When I get the ordered list of authors some names with accented characters like ’Čapek’ and ’Örkény’ are the end of list instead of at (or directly after) section ’c’ and ’o’ of the list.
My 1st try:
SQLite can accept collation definitions. I searched for one that was made to order UTF-8 strings correctly for example Localized and Unicode collation in Android (Accented Search in sqlite (android)) but found none.
My 2nd try:I found an old closed Django ticket about my problem: https://code.djangoproject.com/ticket/8384 It suggests sorting with Python as workaround. I found it quite unsatisfying. Firstly if I sort with a Python method (like below) instead of ordering at model level I cannot use generic views. Secondly ordering with a Python method returns the very same result as the SQLite order_by does: ’Čapek’ and ’Örkény’ are placed after section 'z'.
author_list = sorted(Author.objects.all(), key=lambda x: (x.lastname, x.firstname))
How could I get the queryset ordered correctly?

Thanks to the link CL wrote in his comment, I managed to overcome the difficulties that I replied about. I answer my question to share the piece of code that worked because using Pyuca to sort querysets seems to be a rare and undocumented case.
# import section
from pyuca import Collator
# Calling Collator() takes some seconds so you should create it as reusable variable.
c = Collator()
# ...
# main part:
author_list = sorted(Author.objects.all(), key=lambda x: (c.sort_key(x.lastname), c.sort_key(x.firstname)))
The point is to use sort_key method with the attribute you want to sort by as argument. You can sort by multiple attributes as you see in the example.
Last words: In my language (Hungarian) we use four different accented version of the Latin letter ‘o’: ‘o’, ’ó’, ’ö’, ’ő’. ‘o’ and ‘ó’ are equal in sorting, and ‘ö’ and ‘ő’ are equal too, and ‘ö’/’ő’ are after ‘o’/’ó’. In the default collation table the four letters are equal. Now I try to find a way to define or find a localized collation table.

You could create a new field in the table, fill it with the result of unidecode, then sort according to it.
Using a property to provide get/set methods could help in keeping the fields in sync.

Related

OrientDB metadata attributes

I'm trying orientdb with python. I have created a couple of vertices and I noticed that if I prepend their properties name with #, when I search them on the estudio web app, those properties show up in the metadata section.
Thats interesting, so i went and tried to query vertices filtering by a metadata property id that i created.
When doing so:
select from V where #id = somevalue
I get a huge error msg. I couldnt find a way of querying those custom metadata properties.

Your problem is with python/pyorient (I am presuming you are using pyorient based upon your other questions).
Pyorient attempts to map the database fields to object attributes, but python docs say the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9. (source). Thus, you cannot use '#' as a field name and expect pyorient to work.
I went and took a quick peek at pyorient source code, and it turns out the problem is even bigger than above...
pyorient/types.py
elif key[0:1] == '#':
# special case dict
# { '#my_class': { 'accommodation': 'hotel' } }
self.__o_class = key[1:]
for _key, _value in content[key].items():
self.__o_storage[_key] = _value
So pyorient presumes any record field that begins with an '#' character is followed by a class/cluster name, and then a dict of fields. I suppose you could post to the pyorient issue queue and suggest that the above elif section should check if content[key] is a dict, or a "simple value". If it were a dict, it should be handled like it currently is, otherwise it should be handled like a field, but have the # stripped from it.
Ultimately, not using the # symbol in your field names will be the easiest solution.

'if' element is not in list on Google App Engine

I am building an application for Facebook using Google App Engine. I was trying to compare friends in my user's Facebook account to those already in my application, so I could add them to the database if they are friends in Facebook but not in my application, or not if they are already friends in both. I was trying something like this:
request = graph.request("/me/friends")
user = User.get_by_key_name(self.session.id)
list = []
for x in user.friends:
list.append(x.user)
for friend in request["data"]:
if User.get_by_key_name(friend["id"]):
friendt = User.get_by_key_name(friend["id"])
if friendt.key not in user.friends:
newfriend = Friend(friend = user,
user = friendt,
id = friendt.id)
newfriend.put()
graph.request returns an object with the user's friends. How do I compare content in te two lists of retrieved objects. It doesn't necessarily need to be Facebook related.
(I know this question may be quite silly, but it is really being a pain for me.)

If you upgrade to NDB, the "in" operator will actually work; NDB implements a proper eq operator on Model instances. Note that the key is also compared, so entities that have the same property values but different keys are considered unequal. If you want to ignore the key, consider comparing e1._to_dict() == e2._to_dict().

You should write a custom function to compare your objects, and consider it as a comparison of nested dictionaries. As you will be comparing only the attributes and not functions, you have to do a nested dict comparison.
Reason: All the attributes will be not callable and hopefully, might not start with _, so you have to just compare the remaining elements from the obj.dict and the approach should be bottom up i.e. finish off the nested level objects first (e.g. the main object could host other objects, which will have their own dict)
Lastly, you can consider the accepted answer code here: How to compare two lists of dicts in Python?

django printing tuple-keyed dictionary

I have a dictionary with tuples as keys. e.g.: {('tags','1'): 'name', ('name','first'):'rik', ('name','last'):'atee'}
In django, how do I print the value at ('name','first'), for example? I can do it with dict.items.1, or dict.items.2 - but ordering becomes an issue then.

The Django templating language is intentionally restrictive in what it allows you to reference.
For instance you must use the dot operator to access attributes AND dictionary elements which means the keys you reference must be strings.
From docs:
Variable names must consist of any letter (A-Z), any digit (0-9), an underscore (but they must not start with an underscore) or a dot.
https://docs.djangoproject.com/en/dev/ref/templates/api/#variables-and-lookups
Your options are to either (a) use your view to munge the tuple-keys into a string-format or (b) you can use a different templating engine which allows you to reference with arbitrary keys.
Option (b) isn't actually as bad as it sounds as their are templating languages for Django which are designed as a superset to Django's templating language so (theoretically) all your old templates work and you just get more functionality. I advise you checkout the Jinja templating language, it has the functionality you're looking for.

Any values in the dictionary can be accessed by their corresponding keys
So...
>>> foo = {('tags','1'): 'name', ('name','first'):'rik', ('name','last'):'atee'}
>>> foo[('name','first')]
'rik'
However tuples-as-keys will probably just end up being confusing and error-prone, why not use something like {"tags": ['name'], "first":'rik', "last":'atee'}?

Django-Haystack with Solr contains search

I am using haystack within a project using solr as the backend. I want to be able to perform a contains search, similar to the Django .filter(something__contains="...")
The __startswith option does not suit our needs as it, as the name suggests, looks for words that start with the string.
I tried to use something like *keyword* but Solr does not allow the * to be used as the first character
Thanks.

To get "contains" functionallity you can use:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="100" side="back"/>
<filter class="solr.LowerCaseFilterFactory" />
as index analyzer.
This will create ngrams for every whitespace separated word in your field. For example:
"Index this!" => x, ex, dex, ndex, index, !, s!, is!, his!, this!
As you see this will expand your index greatly but if you now enter a query like:
"nde*"
it will match "ndex" giving you a hit.
Use this approach carefully to make sure that your index doesn't get too large. If you increase minGramSize, or decrease maxGramSize it will not expand the index as mutch but reduce the "contains" functionallity. For instance setting minGramSize="3" will require that you have at least 3 characters in your contains query.

You can achieve the same behavior without having to touch the solr schema. In your index, make your text field an EdgeNgramField instead of a CharField. Under the hood this will generate a similar schema to what lindstromhenrik suggested.

I am using an expression like:
.filter(something__startswith='...')
.filter_or(name=''+s'...')
as is seems solr does not like expression like '...*', but combined with or will do

None of the answers here do a real substring search *keyword*.
They don't find the keyword that is part of a bigger string, (not a prefix or suffix).
Using EdgeNGramFilterFactory or the EdgeNgramField in the indexes can only do a "startswith" or a "endswith" type of filtering.
The solution is to use a NgramField like this:
class MyIndex(indexes.SearchIndex, indexes.Indexable):
...
field_to_index= indexes.NgramField(model_attr='field_name')
...
This is very elegant, because you don't need to manually add anything to the schema.xml

Python: Need to replace a series of different substrings in HTML template with additional HTML or database results

Situation:
I am writing a basic templating system in Python/mod_python that reads in a main HTML template and replaces instances of ":value:" throughout the document with additional HTML or db results and then returns it as a view to the user.
I am not trying to replace all instances of 1 substring. Values can vary. There is a finite list of what's acceptable. It is not unlimited. The syntax for the values is [colon]value[colon]. Examples might be ":gallery: , :related: , :comments:". The replacement may be additional static HTML or a call to a function. The functions may vary as well.
Question:
What's the most efficient way to read in the main HTML file and replace the unknown combination of values with their appropriate replacement?
Thanks in advance for any thoughts/solutions,
c

There are dozens of templating options that already exist. Consider genshi, mako, jinja2, django templates, or more.
You'll find that you're reinventing the wheel with little/no benefit.

If you can't use an existing templating system for whatever reason, your problem seems best tackled with regular expressions:
import re
valre = re.compile(r':\w+:')
def dosub(correspvals, correspfuns, lastditch):
def f(value):
v = value.group()[1:-1]
if v in correspvals:
return correspvals[v]
if v in correspfuns:
return correspfuns[v]() # or whatever args you need
# what if a value has neither a corresponding value to
# substitute, NOR a function to call? Whatever...:
return lastditch(v)
return f
replacer = dosub(adict, another, somefun)
thehtml = valre.sub(replacer, thehtml)
Basically you'll need two dictionaries (one mapping values to corresponding values, another mapping values to corresponding functions to be called) and a function to be called as a last-ditch attempt for values that can't be found in either dictionary; the code above shows you how to put these things together (I'm using a closure, a class would of course do just as well) and how to apply them for the required replacement task.

This is probably a job for a templating engine and for Python there are a number of choices. In this stackoveflow question people have listed their favourites and some helpfully explain why: What is your single favorite Python templating engine?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django, SQLite - Accurate ordering of strings with accented letters - python

You could create a new field in the table, fill it with the result of unidecode, then sort according to it. Using a property to provide get/set methods could help in keeping the fields in sync.

Related

OrientDB metadata attributes

'if' element is not in list on Google App Engine

django printing tuple-keyed dictionary

Django-Haystack with Solr contains search

Python: Need to replace a series of different substrings in HTML template with additional HTML or database results

Categories

Resources