How to get highlighted searches on whoosh - python

I used an example code from pythonhosted.org but nothing seems to happen. This is code I used:
results = mysearcher.search(myquery)
for hit in results:
print(hit["title"])
I entered this code into python but it gives an error saying mysearcher is not defined. So I'm really not sure if I'm missing something out as I'm just trying to get the basics to get me up and running.

You are missing to define the searcher mysearcher, copy the whole code. Here is a complete example:
>>> import whoosh
>>> from whoosh.index import create_in
>>> from whoosh.fields import *
>>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
>>> ix = create_in("indexdir", schema)
>>> writer = ix.writer()
>>> writer.add_document(title=u"First document", path=u"/a",
... content=u"This is the first document we've added!")
>>> writer.add_document(title=u"Second document", path=u"/b",
... content=u"The second one is even more interesting!")
>>> writer.commit()
>>> from whoosh.qparser import QueryParser
>>> with ix.searcher() as searcher:
... query = QueryParser("content", ix.schema).parse("first")
... results = searcher.search(query)
... results[0]
...
{"title": u"First document", "path": u"/a"}
Than you can highlight like this:
for hit in results:
print(hit["title"])
# Assume "content" field is stored
print(hit.highlights("content"))

Related

Why lxml.etree.SubElement(body, "br") will create <br />?

I'm going through the lxml tutorial and I have a question:
Here is the code:
>>> html = etree.Element("html")
>>> body = etree.SubElement(html, "body")
>>> body.text = "TEXT"
>>> etree.tostring(html)
b'<html><body>TEXT</body></html>'
#############LOOK!!!!!!!############
>>> br = etree.SubElement(body, "br")
>>> etree.tostring(html)
b'<html><body>TEXT<br/></body></html>'
#############END####################
>>> br.tail = "TAIL"
>>> etree.tostring(html)
b'<html><body>TEXT<br/>TAIL</body></html>'
As you can see, in the wrapped block, the instruction br = etree.SubElement(body, "br") will only create a <br /> mark, and why is that?
Is br a reserved word?
Thanks to someone's kindly notification, I should publish my answer here:
Look at this code first:
from lxml import etree
if __name__ == '__main__':
print """Trying to create xml file like this:
<html><body>Hello<br/>World</body></html>"""
html_node = etree.Element("html")
body_node = etree.SubElement(html_node, "body")
body_node.text = "Hello"
print "Step1:" + etree.tostring(html_node)
br_node = etree.SubElement(body_node, "br")
print "Step2:" + etree.tostring(html_node)
br_node.tail = "World"
print "Step3:" + etree.tostring(html_node)
br_node.text = "Yeah?"
print "Step4:" + etree.tostring(html_node)
Here is the output:
Trying to create xml file like this:
<html><body>Hello<br/>World</body></html>
Step1:<html><body>Hello</body></html>
Step2:<html><body>Hello<br/></body></html>
Step3:<html><body>Hello<br/>World</body></html>
Step4:<html><body>Hello<br>Yeah?</br>World</body></html>
At first, what I was trying to figure out is:
Why the output of br_node is rather than
You may check the step3 and step4, and the answer is quite clear:
If the element has no content, it's output format would be <"name"/>
Due to the existing semantic of , this easy question confused me for a long time.
Hope this post will help some guys like me.

dynamodb row count via python, boto query

Folks,
Am trying to get the following bit of code working to return the row count in a table:
import boto
import boto.dynamodb2
from boto.dynamodb2.table import Table
from boto.dynamodb2.fields import HashKey, RangeKey
drivers = Table('current_fhv_drivers')
rowcountquery = drivers.query(
number = 'blah',
expiration = 'foo',
count=True,
)
for x in rowcountquery:
print x['Count']
Error I see is:
boto.dynamodb2.exceptions.UnknownFilterTypeError: Operator 'count' from 'count' is not recognized.
Whats the correct syntaxt to get row count :)
Thanks!
That exception is basically telling you that the operator 'count' is not recognized by boto.
If you read the second paragraph on http://boto.readthedocs.org/en/latest/dynamodb2_tut.html#querying you'll see that:
Filter parameters are passed as kwargs & use a __ to separate the fieldname from the operator being used to filter the value.
So I would change your code to:
import boto
import boto.dynamodb2
from boto.dynamodb2.table import Table
from boto.dynamodb2.fields import HashKey, RangeKey
drivers = Table('current_fhv_drivers')
rowcountquery = drivers.query(
number__eq = 'blah',
expiration__eq = 'foo',
count__eq = True,
)
for x in rowcountquery:
print x['Count']
from boto.dynamodb2.table import Table
table = Table('current_fhv_drivers')
print(table.query_count(number__eq = 'blah', expiration__eq = 'foo'))

serializing sqlalchemy class to json

I'm trying to serialize the result (a list) of an sqlalchemy query to json.
this is the class:
class Wikilink(Base):
__tablename__='Wikilinks'
__table_args__={'extend_existing':True}
id = Column(Integer,autoincrement=True,primary_key=True)
title = Column(Unicode(350))
user_ip = Column(String(50))
page = Column(String(20))
revision = Column(String(20))
timestamp = Column(String(50))
and I guess my problem is with the __repr__(self): function.
I tried something like:
return '{{0}:{"title":{1}, "Ip":{2}, "page":{3} ,"revision":{4}}}'.format(self.id,self.title.encode('utf-8'),self.user_ip,self.page,self.revision)
or:
return '{"id"={0}, "title"={1}, "Ip"={2}}'.format(self.id,self.title.encode('utf-8'),self.user_ip.encode('utf-8'),self.page,self.revision)
and I got:
TypeError(repr(o) + " is not JSON serializable")
ValueError: Single '}' encountered in format string
I tried:
return '{id=%d, title=%s, Ip=%s}'%(self.id,self.title.encode('utf-8'),self.user_ip.encode('utf-8'))
and I got:
TypeError: {id=8126, title=1 בדצמבר, Ip=147.237.70.106} is not JSON serializable
adding "" around (according to the JSON formatting) like this: "id"="%d", "title"="%s", "Ip"="%s" didn't help either.
I know this is supposed to be dead simple but I just can't get this right
actually bottle is handling the jsonification part automatically, but trying to call json.dumps on the result gives me the same errors.
Instead of trying to convert to json a string, you could define, for example, your own to_dict method that returns the dictionary structure it seems you're trying to create and, after that, generate the json from that structure:
>>> import json
>>> d = {'id':8126, 'title':u'1 בדצמבר', 'ip':'147.237.70.106'}
>>> json.dumps(d)
'{"ip": "147.237.70.106", "id": 8126, "title": "1 \\u05d1\\u05d3\\u05e6\\u05de\\u05d1\\u05e8"}'
I'm not sure I understand what you tried. Couldn't you build the dict and let json.dumps() do the work for you?
Something like:
>>> class Foo:
... id = 1
... title = 'my title'
... to_jsonize = ['id', 'title']
>>>
>>> dct = {name: getattr(Foo,name) for name in Foo.to_jsonize}
>>> import json
>>> json.dumps(dct)
'{"id": 1, "title": "my title"}'

Access untranslated content of Django's ugettext_lazy

I'm looking for a sane way to get to the untranslated content of a ugettext_lazyied string. I found two ways, but I'm not happy with either one:
the_string = ugettext_lazy('the content')
the_content = the_string._proxy____args[0] # ewww!
or
from django.utils.translation import activate, get_language
from django.utils.encoding import force_unicode
the_string = ugettext_lazy('the content')
current_lang = get_language()
activate('en')
the_content = force_unicode(the_string)
activate(current_lang)
The first piece of code accesses an attribute that has been explicitly marked as private, so there is no telling how long this code will work. The second solution is overly verbose and slow.
Of course, in the actual code, the definition of the ugettext_lazyied string and the code that accesses it are miles appart.
This is the better version of your second solution
from django.utils import translation
the_string = ugettext_lazy('the content')
with translation.override('en'):
content = unicode(the_string)
Another two options. Not very elegant, but not private api and is not slow.
Number one, define your own ugettext_lazy:
from django.utils import translation
def ugettext_lazy(str):
t = translation.ugettext_lazy(str)
t.message = str
return t
>>> text = ugettext_lazy('Yes')
>>> text.message
"Yes"
>>> activate('lt')
>>> unicode(text)
u"Taip"
>>> activate('en')
>>>> unicode(text)
u"Yes"
Number two: redesign your code. Define untranslated messages separately from where you use them:
gettext = lambda s: s
some_text = gettext('Some text')
lazy_translated = ugettext_lazy(text)
untranslated = some_text
You can do that (but you shouldn't):
the_string = ugettext_lazy('the content')
the_string._proxy____args[0]

How do I validate XML document using compact RELAX NG schema in Python?

How do I validate XML document via compact RELAX NG schema in Python?
How about using lxml?
From the docs:
>>> f = StringIO('''\
... <element name="a" xmlns="http://relaxng.org/ns/structure/1.0">
... <zeroOrMore>
... <element name="b">
... <text />
... </element>
... </zeroOrMore>
... </element>
... ''')
>>> relaxng_doc = etree.parse(f)
>>> relaxng = etree.RelaxNG(relaxng_doc)
>>> valid = StringIO('<a><b></b></a>')
>>> doc = etree.parse(valid)
>>> relaxng.validate(doc)
True
>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> relaxng.validate(doc2)
False
If you want to check syntax vs Compact RelaxNG Syntax from command line, you can use pyjing, from the jingtrang module.
It supports .rnc files and displays more details than just True or False. For example:
C:\>pyjing -c root.rnc invalid.xml
C:\invalid.xml:9:9: error: element "name" not allowed here; expected the element end-tag or element "bounds"
NOTE: it is a Python wrapper of the Java jingtrang so it requires to have Java installed.
If you want to check the syntax from within Python, you can
Use pytrang (from jingtrang wrapper) to convert "Compact RelaxNG" (.rnc) to XML RelaxNG (.rng):
pytrang root.rnc root.rng
Use lxml to parse converted .rng file like this: https://lxml.de/validation.html#relaxng
That would be something like that:
>>> from lxml import etree
>>> from subprocess import call
>>> call("pytrang root.rnc root.rng")
>>> with open("root.rng") as f:
... relaxng_doc = etree.parse(f)
>>> relaxng = etree.RelaxNG(relaxng_doc)
>>> valid = StringIO('<a><b></b></a>')
>>> doc = etree.parse(valid)
>>> relaxng.validate(doc)
True
>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> relaxng.validate(doc2)
False

Categories