How to store wiki sites (vcs)

How to store wiki sites (vcs) - python

as a personal project I am trying to write a wiki with the help of django. I'm a beginner when it comes to web development. I am at the (early) point where I need to decide how to store the wiki sites. I have three approaches in mind and would like to know your suggestion.
Flat files
I considered a flat file approach with a version control system like git or mercurial. Firstly, I would have some example wikis to look at like http://hatta.sheep.art.pl/. Secondly, the vcs would probably deal with editing conflicts and keeping the edit history, so I would not have to reinvent the wheel. And thirdly, I could probably easily clone the wiki repository, so I (or for that matter others) can have an offline copy of the wiki.
On the other hand, as far as I know, I can not use django models with flat files. Then, if I wanted to add fields to a wiki site, like a category, I would need to somehow keep a reference to that flat file in order to associate the fields in the database with the flat file. Besides, I don't know if it is a good idea to have all the wiki sites in one repository. I imagine it is more natural to have kind of like a repository per wiki site resp. file. Last but not least, I'm not sure, but I think using flat files would limit my deploying capabilities because web hosts maybe don't allow creating files (I'm thinking, for example, of Google App Engine)
Storing in a database
By storing the wiki sites in the database I can utilize django models and associate arbitrary fields with the wiki site. I probably would also have an easier life deploying the wiki. But I would not get vcs features like history and conflict resolving per se. I searched for django-extensions to help me and I found django-reversion. However, I do not fully understand if it fit my needs. Does it track model changes like for example if I change the django model file, or does it track the content of the models (which would fit my need). Plus, I do not see if django reversion would help me with edit conflicts.
Storing a vcs repository in a database field
This would be my ideal solution. It would combine the advantages of both previous approaches without the disadvantages. That is; I would have vcs features but I would save the wiki sites in a database. The problem is: I have no idea how feasible that is. I just imagine saving a wiki site/source together with a git/mercurial repository in a database field. Yet, I somehow doubt database fields work like that.
So, I'm open for any other approaches but this is what I came up with. Also, if you're interested, you can find the crappy early test I'm working on here http://github.com/eugenkiss/instantwiki-test

In none of your choices have you considered whether you wish to be able to search your wiki. If this is a consideration, having the 'live' copy of each page in a database with full text search would be hugely beneficial. For this reason, I would personally go with storing the pages in a database every time - otherwise you'll have to create your own index somewhere.
As far as version logging goes, you only need store the live copy in an indexable format. You could automatically create a history item within your 'page' model when an changed page is written back to the database. You can cut down on the storage overhead of earlier page revisions by compressing the data, should this become necessary.
If you're expecting a massive amount of change logging, you might want to read this answer here:
How does one store history of edits effectively?

Creating a wiki is fun and rewarding, but there are a lot of prebuilt wiki software packages already. I suggest Wikipedia's List of wiki software. In particular, MoinMoin and Trac are good. Finally, John Sutherland has made a wiki using Django.

Related

Changing existing pages in Agilo/Trac via plugin

Is it possible to make changes to an existing page in Trac via a plugin?
(I am not talking about the wiki, but the ticket system).
I am trying to make a plugin that uses the View Tickets -> Custom Query view and gets the tickets from the resulting table of tickets. The goal is to use these tickets to modify them via a predefined python script, and then optionally print them.
Is this possible via the trac api or would one have to make a whole new page and write that whole query functionality from scratch to get the tickets from the database?
I feel that this is not very clearly documented by Trac, so I hope there are some people with experience in plugin development for trac and/or agilo for trac.

The first thing that comes to mind is that ITicketManipulator is not called in batch modify events. You might be able to implement a solution using IRequestFilter. I'd need more information about how you plan to modify the tickets in order to give better advice.

How can I update a plone page via a script?

I have a large amount of automatically generated html files that I would like to push to my Plone website with a script. I currently generate the files, log into Plone, click edit on each individual page and copy and paste the html into the editor. I'd like to automate this. It would be nice to retain the plone versioning, have a auto generated comment for the edit, and come from a specific user.
I've read and tried Webdav with little luck at getting it working consistently and know that there is a way to connect to plone via ftp, but haven't tried it. I'm not sure if these are the methods that I need.
My google searches aren't leading me to anything useful. Any ideas on where to start looking for a solution to this? Or any tips on implementing it?

You can script anything in Plone via the following methods:
Through-the-web via API calls (e.g. XML-RPC, wsapi, etc.)
The bin/instance run script provided by plone.recipe.zope2instance (See charm for an example of this).
You can also use a migration framework like:
collective.transmogrifier
which allows you to write migration code, and trigger it via GenericSetup or Browser view. Additionally, there are applications written on top of Transmogrifier aimed roughly at what you are describing, the most popular of which is:
funnelweb
I would recommend that you consider using or writing a Transmogrifier "blueprint(s)" to do your import, and execute the pipeline with a tool that makes that easy:
mr.migrator
You can find blueprints by searching PyPI for "transmogrify". One popular set of blueprints is:
quintagroup.transmogrifier
One of the main attractions to the Transmogrifier approach, aside from getting the job done, is the ability to share useful blueprints with others.

I think transmogrifier is the best tool for this job, but this will definitely be a programming task no matter how you do it. It's used for many such migration jobs such as migrating from drupal.

There's an add-on, wsapi4plone.core that pumazi at WebLion started that provides web services for portals which you can then hook into. You can create, modify, delete content via XML-RPC calls. The only caveat is that it doesn't yet work with Collections (criteria specifically).
project: http://pypi.python.org/pypi/wsapi4plone.core
docs: http://packages.python.org/wsapi4plone.core/
You can also do it programmatically by hooking into the ZODB via Python (zopepy or some other method).
These should get you started:
http://plone.org/documentation/kb/manipulating-plone-objects-programmatically/reading-and-writing-field-values - you should be able to get an understanding of accessors and mutators (setters and getters), in your case you are going to be more than likely working with obj.Text (getter) and obj.setText (setter).
https://weblion.psu.edu/trac/weblion/wiki/AutomatingObjectCreation - lots of examples (slightly outdated but still relevant)

http://plone.org/documentation/faq/upload-images-files
Try to enable Webdav or ftp in Plone, then you can access Plone via webdav or ftp clients, pushing the html files. Plone (Zope) will recognises the html files as Pages.

Django/python and Apache Solr: pysolr or solrpy?

brand new on this forum and this is my first post!
At work we're starting a project which uses Apache Solr and i'm in charge of the frontend system (Django-based).
Our solr database isn't related to any other db engine nor to any models' class, so Haystack isn't good for us (since its strictly related to the models).
I was looking at http://code.google.com/p/pysolr/ and http://code.google.com/p/solrpy/
Basically, they're similar. I like more solrpy, since it uses POST requests and we can mask our users queries, but this makes its paginator harder to use (i guess..).
Other side, pysolr, thanks to the GET method, performs better (lower query timing), but so far i couldn't execute a query without getting a badrequest error.
Before choosing one, i wanted to ask the community any opinion. Users need to do only searches, our data is handled by a java process, no other db is used (except for storing user informations), and we need to use all solr features (faceting, highlight, word stopping, analyzers...).
What will you choose? And why? Any good code example you can point me at? I was looking throu the haystack source to see how they did implement all...
Thanks all!

We have used 'solrpy', but encountered some problems with it.
Sunburnt is actually an interesting API:
https://github.com/tow/sunburnt/
Actively developed, and easy to use. Unfortunately it introduces some additional dependencies.

Efficient storage of and access to web pages with Python

So like many people I want a way to download, index/extract information and store web pages efficiently. My first thought is to use MySQL and simply shove the pages in which would let me use FULLTEXT searches which would let me do ad hoc queries easily (in case I want to see if something exists and extract it/etc.). But of course performance wise I have some concerns especially with large objects/pages and high volumes of data. So that leads me to look at things like CouchDB/search engines/etc. So to summarize, my basic requirements are:
It must be Python compatible (libraries/etc.)
Store meta data (URL, time retrieved, any GET/POST stuff I sent), response code, etc. of the page I requested.
Store a copy of the original web page as sent by the server (might be content, might be 404 search page, etc.).
Extract information from the web page and store it in a database.
Have the ability to do ad hoc queries on the existing corpus of original web pages (for example a new type of information I want to extract, or to see how many of the pages have the string "fizzbuzz" or whatever in them.
And of course it must be open source/Linux compatible, I have no interest in something I can't modify or fiddle with.
So I'm thinking several broad options are:
Toss everything into MySQL, use FULLTEXT, go nuts, shard the contact if needed.
Toss meta data into MySQL, store the data on the file system or something like CouchDB, write some custom search stuff.
Toss meta data into MySQL, store the data on a file system with a web server (maybe /YYYY/MM/DD/HH/MM/SS/URL/), make sure there is no default index.html/etc specified (directory index each directory in other words) and use some search engine like Lucene or Sphinx index the content and use that to search. Biggest downside I see here is the inefficiency of repeatedly crawling the site.
Other solutions?
When answering please include links to any technologies you mention and if possible what programming languages it has libraries for (i.e. if it's Scala only or whatever it's probably not that useful since this is a Python project). If this question has already been asked (I'm sure it must have been) please let me know (I searched, no luck).

Why do you think solution (3), the Sphinx-based one, requires "repeatedly crawling the site"? Sphinx can accept and index many different data sources, including MySQL and PostgreSQL "natively" (there are contributed add-ons for other DBs such as Firebird) -- you can keep your HTML docs as columns in your DB if you like (modern PostgreSQL versions should have no trouble with that, and I imagine that MySQL's wouldn't either), just use Sphinx superior indexing and full-text search (including stemming &c). Your metadata all comes from headers, after all (plus the HTTP request body if you want to track requests in which you POSTed data, but not the HTTP response body at any rate).
One important practical consideration: I would recommend standardizing on UTF-8 -- html will come to you in all sorts of weird encodings, but there's no need to get crazy supporting that at search time -- just transcode every text page to UTF-8 upon arrival (from whatever funky encoding it came in), before storing and indexing it, and live happily ever after.
Maybe you could special-case non-textual responses to keep those in files (I can imagine that devoting gigabytes in the DB to storing e.g. videos which can't be body-indexed anyway might not be a good use of resources).
And BTW, Sphinx does come with Python bindings, as you request.

You may be trying to achieve too much with the storage of the html (and supporting files). It seems you wish this repository would both
allow to display a particular page as it was in its original site
provide indexing for locating pages relevant to a particular search criteria
The html underlying a web page once looked a bit akin to a self-standing document, but the pages crawled off the net nowadays are much messier: javascript, ajax snippets, advertisement sections, image blocks etc.
This reality may cause you to rethink the one storage for all html approach. (And also the parsing / pre-processing of the material crawled, but that's another story...)
On the other hand, the distinction between metadata and the true text content associated with the page doesn't need to be so marked. By "true text content", I mean [possibly partially marked-up] text from the web pages that is otherwise free of all other "Web 2.0 noise") Many search engines, including Solr (since you mentioned Lucene) now allow mixing the two genres, in the form of semi-structured data. For operational purposes (eg to task the crawlers etc.), you may keep a relational store with management related metadata, but the idea is that for search purposes, fielded and free-text info can coexist nicely (at the cost of pre-processing much of the input data).

It sounds to me like you need a content management system. Check out Plone. If that's not what you want maybe a web framework, like Grok, BFG, Django, Turbogears, or anything on this list. If that isn't good either, then I don't know what you are asking. :-)

Creating database schema for parsed feed

Additional questions regarding SilentGhost's initial answer to a problem I'm having parsing Twitter RSS feeds. See also partial code below.
First, could I insert tags[0], tags[1], etc., into the database, or is there a different/better way to do it?
Second, almost all of the entries have a url, but a few don't; likewise, many entries don't have the hashtags. So, would the thing to do be to create default values for url and tags? And if so, do you have any hints on how to do that? :)
Third, when you say the single-table db design is not optimal, do you mean I should create a separate table for tags? Right now, I have one table for the RSS feed urls and another table with all the rss entry data (summar.y, date, etc.).
I've pasted in a modified version of the code you posted. I had some success in getting a "tinyurl" variable to get into the sqlite database, but now it isn't working. Not sure why.
Lastly, assuming I can get the whole thing up and running (smile), is there a central site where people might appreciate seeing my solution? Or should I just post something on my own blog?
Best,
Greg

I would suggest reading up on database normalisation, especially on 1st and 2nd normal forms. Once you're done with it, I hope there won't be need for default values, and your db schema evolves into something more appropriate.
There are plenty of options for sharing your source code on the web, depending on what versioning system you're most comfortable with you might have a look at such well know sites as google code, bitbucket, github and many other.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.