OrientDB metadata attributes

OrientDB metadata attributes - python

I'm trying orientdb with python. I have created a couple of vertices and I noticed that if I prepend their properties name with #, when I search them on the estudio web app, those properties show up in the metadata section.
Thats interesting, so i went and tried to query vertices filtering by a metadata property id that i created.
When doing so:
select from V where #id = somevalue
I get a huge error msg. I couldnt find a way of querying those custom metadata properties.

Your problem is with python/pyorient (I am presuming you are using pyorient based upon your other questions).
Pyorient attempts to map the database fields to object attributes, but python docs say the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9. (source). Thus, you cannot use '#' as a field name and expect pyorient to work.
I went and took a quick peek at pyorient source code, and it turns out the problem is even bigger than above...
pyorient/types.py
elif key[0:1] == '#':
# special case dict
# { '#my_class': { 'accommodation': 'hotel' } }
self.__o_class = key[1:]
for _key, _value in content[key].items():
self.__o_storage[_key] = _value
So pyorient presumes any record field that begins with an '#' character is followed by a class/cluster name, and then a dict of fields. I suppose you could post to the pyorient issue queue and suggest that the above elif section should check if content[key] is a dict, or a "simple value". If it were a dict, it should be handled like it currently is, otherwise it should be handled like a field, but have the # stripped from it.
Ultimately, not using the # symbol in your field names will be the easiest solution.

Related

Write a custom JSON interpreter for a file that looks like json but isnt using Python

What I need to do is to write a module that can read and write files that use the PDX script language. This language looks alot like json but has enough differences that a custom encoder/decoder is needed to do anything with those files (without a mess of regex substitutions which would make maintenance hell). I originally went with just reading them as txt files, and use regex to find and replace things to convert it to valid json. This lead me to my current point, where any additions to the code requires me to write far more code than I would want to, just to support some small new thing. So using a custom json thing I could write code that shows what valid key:value pairs are, then use that to handle the files. To me that will be alot less code and alot easier to maintain.
So what does this code look like? In general it looks like this (tried to put all possible syntax, this is not an example of a working file):
#key = value # this is the definition for the scripted variable
key = {
# This is a comment. No multiline comments
function # This is a single key, usually optimize_memory
# These are the accepted key:value pairs. The quoted version is being phased out
key = "value"
key = value
key = #key # This key is using a scripted variable, defined either in the file its in or in the `scripted_variables` folder. (see above for example on how these are initially defined)
# type is what the key type is. Like trigger:planet_stability where planet_stability is a trigger
key = type:key
# Variables like this allow for custom names to be set. Mostly used for flags and such things
[[VARIABLE_NAME]
math_key = $VARIABLE_NAME$
]
# this is inline math, I dont actually understand how this works in the script language yet as its new. The "<" can be replaced with any math symbol.
# Valid example: planet_stability < #[ stabilitylevel2 + 10 ]
key < #[ key + 10 ]
# This is used alot to handle code blocks. Valid example:
# potential = {
# exists = owner
# owner = {
# has_country_flag = flag_name
# }
# }
key = {
key = value
}
# This is just a list. Inline brackets are used alot which annoys me...
key = { value value }
}
The major differences between json and PDX script is the nearly complete lack of quotations, using an equals sign instead of a colon for separation and no comma's at the end of the lines. Now before you ask me to change the PDX code, I cant. Its not mine. This is what I have to work with and cant make any changes to the syntax. And no I dont want to convert back and forth as I have already mentioned this would require alot of work. I have attempted to look for examples of this, however all I can find are references to convert already valid json to a python object, which is not what I want. So I cant give any examples of what I have already done, as I cant find anywhere to even start.
Some additional info:
Order of key:value pairs does not technically matter, however it is expected to be in a certain order, and when not in that order causes issues with mods and conflict solvers
bool properties always use yes or no rather than true or false
Lowercase is expected and in some cases required
Math operators are used as separators as well, eg >=, <= ect
The list of syntax is not exhaustive, but should contain most of the syntax used in the language
Past work:
My last attempts at this all revolved around converting it from a text file to a json file. This was alot of work just to get a small piece of this to work.
Example:
potential = {
exists = owner
owner = {
is_regular_empire = yes
is_fallen_empire = no
}
NOR = {
has_modifier = resort_colony
has_modifier = slave_colony
uses_habitat_capitals = yes
}
}
And what i did to get most of the way to json (couldnt find a way to add quotes)
test_string = test_string.replace("\n", ",")
test_string = test_string.replace("{,", "{")
test_string = test_string.replace("{", "{\n")
test_string = test_string.replace(",", ",\n")
test_string = test_string.replace("}, ", "},\n")
test_string = "{\n" + test_string + "\n}"
# Replace the equals sign with a colon
test_string = test_string.replace(" =", ":")
This resulted in this:
{
potential: {
exists: owner,
owner: {
is_regular_empire: yes,
is_fallen_empire: no,
},
NOR: {
has_modifier: resort_colony,
has_modifier: slave_colony,
uses_habitat_capitals: yes,
},
}
}
Very very close yes, but in no way could I find a way to add the quotations to each word (I think I did try a regex sub, but wasnt able to get it to work, since this whole thing is just one unbroken string), making this attempt stuck and also showing just how much work is required just to get a very simple potential block to mostly work. However this is not the method I want anymore, one because its alot of work and two because I couldnt find anything to finish it. So a custom json interpreter is what I want.

The classical approach (potentially leading to more code, but also more "correctness"/elegance) is probably to build a "recursive descent parser", from a bunch of conditionals/checks, loops and (sometimes recursive?) functions/handlers to deal with each of the encountered elements/characters on the input stream. An implicit parse/call tree might be sufficient if you directly output/print the JSON equivalent, or otherwise you could also create a representation/model in memory for later output/conversion.
Related book recommendation could be "Language Implementation Patterns" by Terence Parr, me avoiding to promote my own interpreters and introductory materials :-) In case you need further help, maybe write me?

Django, SQLite - Accurate ordering of strings with accented letters

Main problem:
I have a Python (3.4) Django (1.6) web app using an SQLite (3) database containing a table of authors. When I get the ordered list of authors some names with accented characters like ’Čapek’ and ’Örkény’ are the end of list instead of at (or directly after) section ’c’ and ’o’ of the list.
My 1st try:
SQLite can accept collation definitions. I searched for one that was made to order UTF-8 strings correctly for example Localized and Unicode collation in Android (Accented Search in sqlite (android)) but found none.
My 2nd try:I found an old closed Django ticket about my problem: https://code.djangoproject.com/ticket/8384 It suggests sorting with Python as workaround. I found it quite unsatisfying. Firstly if I sort with a Python method (like below) instead of ordering at model level I cannot use generic views. Secondly ordering with a Python method returns the very same result as the SQLite order_by does: ’Čapek’ and ’Örkény’ are placed after section 'z'.
author_list = sorted(Author.objects.all(), key=lambda x: (x.lastname, x.firstname))
How could I get the queryset ordered correctly?

Thanks to the link CL wrote in his comment, I managed to overcome the difficulties that I replied about. I answer my question to share the piece of code that worked because using Pyuca to sort querysets seems to be a rare and undocumented case.
# import section
from pyuca import Collator
# Calling Collator() takes some seconds so you should create it as reusable variable.
c = Collator()
# ...
# main part:
author_list = sorted(Author.objects.all(), key=lambda x: (c.sort_key(x.lastname), c.sort_key(x.firstname)))
The point is to use sort_key method with the attribute you want to sort by as argument. You can sort by multiple attributes as you see in the example.
Last words: In my language (Hungarian) we use four different accented version of the Latin letter ‘o’: ‘o’, ’ó’, ’ö’, ’ő’. ‘o’ and ‘ó’ are equal in sorting, and ‘ö’ and ‘ő’ are equal too, and ‘ö’/’ő’ are after ‘o’/’ó’. In the default collation table the four letters are equal. Now I try to find a way to define or find a localized collation table.

You could create a new field in the table, fill it with the result of unidecode, then sort according to it.
Using a property to provide get/set methods could help in keeping the fields in sync.

Using Strings to Name Hash Keys?

I'm working through a book called "Head First Programming," and there's a particular part where I'm confused as to why they're doing this.
There doesn't appear to be any reasoning for it, nor any explanation anywhere in the text.
The issue in question is in using multiple-assignment to assign split data from a string into a hash (which doesn't make sense as to why they're using a hash, if you ask me, but that's a separate issue). Here's the example code:
line = "101;Johnny 'wave-boy' Jones;USA;8.32;Fish;21"
s = {}
(s['id'], s['name'], s['country'], s['average'], s['board'], s['age']) = line.split(";")
I understand that this will take the string line and split it up into each named part, but I don't understand why what I think are keys are being named by using a string, when just a few pages prior, they were named like any other variable, without single quotes.
The purpose of the individual parts is to be searched based on an individual element and then printed on screen. For example, being able to search by ID number and then return the entire thing.
The language in question is Python, if that makes any difference. This is rather confusing for me, since I'm trying to learn this stuff on my own.
My personal best guess is that it doesn't make any difference and that it was personal preference on part of the authors, but it bewilders me that they would suddenly change form like that without it having any meaning, and further bothers me that they don't explain it.
EDIT: So I tried printing the id key both with and without single quotes around the name, and it worked perfectly fine, either way. Therefore, I'd have to assume it's a matter of personal preference, but I still would like some info from someone who actually knows what they're doing as to whether it actually makes a difference, in the long run.
EDIT 2: Apparently, it doesn't make any sense as to how my Python interpreter is actually working with what I've given it, so I made a screen capture of it working https://www.youtube.com/watch?v=52GQJEeSwUA

I don't understand why what I think are keys are being named by using a string, when just a few pages prior, they were named like any other variable, without single quotes
The answer is right there. If there's no quote, mydict[s], then s is a variable, and you look up the key in the dict based on what the value of s is.
If it's a string, then you look up literally that key.
So, in your example s[name] won't work as that would try to access the variable name, which is probably not set.
EDIT: So I tried printing the id key both with and without single
quotes around the name, and it worked perfectly fine, either way.
That's just pure luck... There's a built-in function called id:
>>> id
<built-in function id>
Try another name, and you'll see that it won't work.

Actually, as it turns out, for dictionaries (Python's term for hashes) there is a semantic difference between having the quotes there and not.
For example:
s = {}
s['test'] = 1
s['othertest'] = 2
defines a dictionary called s with two keys, 'test' and 'othertest.' However, if I tried to do this instead:
s = {}
s[test] = 1
I'd get a NameError exception, because this would be looking for an undefined variable called test whose value would be used as the key.
If, then, I were to type this into the Python interpreter:
>>> s = {}
>>> s['test'] = 1
>>> s['othertest'] = 2
>>> test = 'othertest'
>>> print s[test]
2
>>> print s['test']
1
you'll see that using test as a key with no quotes uses the value of that variable to look up the associated entry in the dictionary s.
Edit: Now, the REALLY interesting question is why using s[id] gave you what you expected. The keyword "id" is actually a built-in function in Python that gives you a unique id for an object passed as its argument. What in the world the Python interpreter is doing with the expression s[id] is a total mystery to me.
Edit 2: Watching the OP's Youtube video, it's clear that he's staying consistent when assigning and reading the hash about using id or 'id', so there's no issue with the function id as a hash key somehow magically lining up with 'id' as a hash key. That had me kind of worried for a while.

Why is this Python method leaking memory?

This method iterate over a list of terms in the data base, check if the terms are in a the text passed as argument, and if one is, replace it with a link to the search page with the term as parameter.
The number of terms is high (about 100000), so the process is pretty slow, but this is Ok since it is performed as a cron job. However, it causes the script memory consumtion to skyrocket and I can't find why:
class SearchedTerm(models.Model):
[...]
#classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
"""
Take a list of all researched terms and search them in the
text. If they exist, turn them into links to the search
page.
This process is limited to `count` replacements maximum.
WARNING: because the sites got different URLS schemas, we don't
provides direct links, but we inject the {% url %} tag
so it must be rendered before display. You can use the `eval`
tag from `libs` for this. Since they got different namespace as
well, we enter a generic 'namespace' and delegate to the
template to change it with the proper one as well.
If you have a batch process to do, you can pass a query set
that will be used instead of getting all searched term at
each calls.
"""
found = 0
terms = queryset or cls.on_site.all()
# to avoid duplicate searched terms to be replaced twice
# keep a list of already linkified content
# added words we are going to insert with the link so they won't match
# in case of multi passes
processed = set((u'video', u'streaming', u'title',
u'search', u'namespace', u'href', u'title',
u'url'))
for term in terms:
text = term.text.lower()
# no small word and make
# quick check to avoid all the rest of the matching
if len(text) < 3 or text not in string:
continue
if found and cls._is_processed(text, processed):
continue
# match the search word with accent, for any case
# ensure this is not part of a word by including
# two 'non-letter' character on both ends of the word
pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text,
re.UNICODE|re.IGNORECASE)
if re.search(pattern, string):
found += 1
# create the link string
# replace the word in the description
# use back references (\1, \2, etc) to preserve the original
# formatin
# use raw unicode strings (ur"string" notation) to avoid
# problems with accents and escaping
query = '-'.join(term.text.split())
url = ur'{%% url namespace:static-search "%s" %%}' % query
replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url
string = re.sub(pattern, replace_with, string)
processed.add(text)
if found >= 3:
break
return string
You'll probably want this code as well:
class SearchedTerm(models.Model):
[...]
#classmethod
def _is_processed(cls, text, processed):
"""
Check if the text if part of the already processed string
we don't use `in` the set, but `in ` each strings of the set
to avoid subtring matching that will destroy the tags.
This is mainly an utility function so you probably won't use
it directly.
"""
if text in processed:
return True
return any(((text in string) for string in processed))
I really have only two objects with references that could be the suspects here: terms and processed. But I can't see any reason for them to not being garbage collected.
EDIT:
I think I should say that this method is called inside a Django model method itself. I don't know if it's relevant, but here is the code:
class Video(models.Model):
[...]
def update_html_description(self, links=3, queryset=None):
"""
Take a list of all researched terms and search them in the
description. If they exist, turn them into links to the search
engine. Put the reset into `html_description`.
This use `add_search_link_to_text` and has therefor, the same
limitations.
It DOESN'T call save().
"""
queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
text = self.description or self.title
self.html_description = SearchedTerm.add_search_links_to_text(text,
links,
queryset)
I can imagine that the automatic Python regex caching eats up some memory. But it should do it only once and the memory consumtion goes up at every call of update_html_description.
The problem is not just that it consumes a lot of memory, the problem is that it does not release it: every calls take about 3% of the ram, eventually filling it up and crashing the script with 'cannot allocate memory'.

The whole queryset is loaded into memory once you call it, that is what will eat up your memory. You want to get chunks of results if the resultset is that large, it might be more hits on the database but it will mean a lot less memory consumption.

I was complety unable to find the cause of the problem, but for now I'm by passing this by isolating the infamous snippet by calling a script (using subprocess) that containes this method call. The memory goes up but of course, goes back to normal after the python process dies.
Talk about dirty.
But that's all I got for now.

make sure that you aren't running in DEBUG.

I think I should say that this method is called inside a Django model method itself.
#classmethod
Why? Why is this "class level"
Why aren't these ordinary methods that can have ordinary scope rules and -- in the normal course of events -- get garbage collected?
In other words (in the form of an answer)
Get rid of #classmethod.

How to inspect mystery deserialized object in Python

I'm trying to load JSON back into an object. The "loads" method seems to work without error, but the object doesn't seem to have the properties I expect.
How can I go about examining/inspecting the object that I have (this is web-based code).
results = {"Subscriber": {"firstname": "Neal", "lastname": "Walters"}}
subscriber = json.loads(results)
for item in inspect.getmembers(subscriber):
self.response.out.write("<BR>Item")
for subitem in item:
self.response.out.write("<BR> SubItem=" + subitem)
The attempt above returned this:
Item
SubItem=__class__
I don't think it matters, but for context:
The JSON is actually coming from a urlfetch in Google App Engine to
a rest web service created using this utility:
http://code.google.com/p/appengine-rest-server.
The data is being retrieved from a datastore with this definition:
class Subscriber(db.Model):
firstname = db.StringProperty()
lastname = db.StringProperty()
Thanks,
Neal
Update #1: Basically I'm trying to deserialize JSON back into an object.
In theory it was serialized from an object, and I want to now get it back into an object.
Maybe the better question is how to do that?
Update #2: I was trying to abstract a complex program down to a few lines of code, so I made a few mistakes in "pseudo-coding" it for purposes of posting here.
Here's a better code sample, now take out of website where I can run on PC.
results = '{"Subscriber": {"firstname": "Neal", "lastname": "Walters"}}'
subscriber = json.loads(results)
for key, value in subscriber.items():
print " %s: %s" %(key, value)
The above runs, what it displays doesn't look any more structured than the JSON string itself. It displays this:
Subscriber: {u'lastname': u'Walters', u'firstname': u'Neal'}
I have more of a Microsoft background, so when I hear serialize/deserialize, I think going from an object to a string, and from a string back to an object. So if I serialize to JSON, and then deserialize, what do I get, a dictionary, a list, or an object? Actually, I'm getting the JSON from a REST webmethod, that is on my behalf serializing my object for me.
Ideally I want a subscriber object that matches my Subscriber class above, and ideally, I don't want to write one-off custom code (i.e. code that would be specific to "Subscriber"), because I would like to do the same thing with dozens of other classes. If I have to write some custom code, I will need to do it generically so it will work with any class.
Update #3: This is to explain more of why I think this is a needed tool. I'm writing a huge app, probably on Google App Engine (GAE). We are leaning toward a REST architecture for several reasons, but one is that our web GUI should access the data store via a REST web layer. (I'm a lot more used to SOAP, so switching to REST is a small challenge in itself). So one of the classic ways of getting and update data is through a business or data tier. By using the REST utility mention above, I have the choice of XML or JSON. I'm hoping to do a small working prototype of both before we develop the huge app). Then, suppose we have a successful app, and GAE doubles it prices. Then we can rewrite just the data tier, and take our Python/Django user tier (web code), and run it on Amazon or somewhere else.
If I'm going to do all that, why would I want everything to be dictionary objects. Wouldn't I want the power of full-blown class structure? One of the next tricks is sort of an object relational mapping (ORM) so that we don't necessarily expose our exact data tables, but more of a logical layer.
We also want to expose a RESTful API to paying users, who might be using any language. For them, they can use XML or JSON, and they wouldn't use the serialize routine discussed here.

json only encodes strings, floats, integers, javascript objects (python dicts) and lists.
You have to create a function to turn the returned dictionary into a class and then pass it to a json.loads using the object_hook keyword argument along with the json string. Heres some code that fleshes it out:
import json
class Subscriber(object):
firstname = None
lastname = None
class Post(object):
author = None
title = None
def decode_from_dict(cls,vals):
obj = cls()
for key, val in vals.items():
setattr(obj, key, val)
return obj
SERIALIZABLE_CLASSES = {'Subscriber': Subscriber,
'Post': Post}
def decode_object(d):
for field in d:
if field in SERIALIZABLE_CLASSES:
cls = SERIALIZABLE_CLASSES[field]
return decode_from_dict(cls, d[field])
return d
results = '''[{"Subscriber": {"firstname": "Neal", "lastname": "Walters"}},
{"Post": {"author": {"Subscriber": {"firstname": "Neal",
"lastname": "Walters"}}},
"title": "Decoding JSON Objects"}]'''
result = json.loads(results, object_hook=decode_object)
print result
print result[1].author
This will handle any class that can be instantiated without arguments to the constructor and for which setattr will work.
Also, this uses json. I have no experience with simplejson so YMMV but I hear that they are identical.
Note that although the values for the two subscriber objects are identical, the resulting objects are not. This could be fixed by memoizing the decode_from_dict class.

results in your snippet is a dict, not a string, so the json.loads would raise an exception. If that is fixed, each subitem in the inner loop is then a tuple, so trying to add it to a string as you are doing would raise another exception. I guess you've simplified your code, but the two type errors should already show that you simplified it too much (and incorrectly). Why not use an (equally simplified) working snippet, and the actual string you want to json.loads instead of one that can't possibly reproduce your problem? That course of action would make it much easier to help you.
Beyyond peering at the actual string, and showing some obvious information such as type(subscriber), it's hard to offer much more help based on that clearly-broken code and such insufficient information:-(.
Edit: in "update2", the OP says
It displays this: Subscriber: {u'lastname': u'Walters', u'firstname': u'Neal'}
...and what else could it possibly display, pray?! You're printing the key as string, then the value as string -- the key is a string, and the value is another dict, so of course it's "stringified" (and all strings in JSON are Unicode -- just like in C# or Java, and you say you come from a MSFT background, so why does this surprise you at all?!). str(somedict), identically to repr(somedict), shows the repr of keys and values (with braces around it all and colons and commas as appropriate separators).
JSON, a completely language-independent serialization format though originally centered on Javascript, has absolutely no idea of what classes (if any) you expect to see instances of (of course it doesn't, and it's just absurd to think it possibly could: how could it possibly be language-independent if it hard-coded the very concept of "class", a concept which so many languages, including Javascript, don't even have?!) -- so it uses (in Python terms) strings, numbers, lists, and dicts (four very basic data types that any semi-decent modern language can be expected to have, at least in some library if not embedded in the language proper!). When you json.loads a string, you'll always get some nested combination of the four datatypes above (all strings will be unicode and all numbers will be floats, BTW;-).
If you have no idea (and don't want to encode by some arbitrary convention or other) what class's instances are being serialized, but absolutely must have class instances back (not just dicts etc) when you deserialize, JSON per se can't help you -- that metainformation cannot possibly be present in the JSON-serialized string itself.
If you're OK with the four fundamental types, and just want to see some printed results that you consider "prettier" than the default Python string printing of the fundamental types in question, you'll have to code your own recursive pretty-printing function depending on your subjective definition of "pretty" (I doubt you'd like Python's own pprint standard library module any more than you like your current results;-).

My guess is that loads is returning a dictionary. To iterate over its content, use something like:
for key, value in subscriber.items():
self.response.out.write("%s: %s" %(key, value))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.