Creating database schema for parsed feed - python

Additional questions regarding SilentGhost's initial answer to a problem I'm having parsing Twitter RSS feeds. See also partial code below.
First, could I insert tags[0], tags[1], etc., into the database, or is there a different/better way to do it?
Second, almost all of the entries have a url, but a few don't; likewise, many entries don't have the hashtags. So, would the thing to do be to create default values for url and tags? And if so, do you have any hints on how to do that? :)
Third, when you say the single-table db design is not optimal, do you mean I should create a separate table for tags? Right now, I have one table for the RSS feed urls and another table with all the rss entry data (summar.y, date, etc.).
I've pasted in a modified version of the code you posted. I had some success in getting a "tinyurl" variable to get into the sqlite database, but now it isn't working. Not sure why.
Lastly, assuming I can get the whole thing up and running (smile), is there a central site where people might appreciate seeing my solution? Or should I just post something on my own blog?
Best,
Greg

I would suggest reading up on database normalisation, especially on 1st and 2nd normal forms. Once you're done with it, I hope there won't be need for default values, and your db schema evolves into something more appropriate.
There are plenty of options for sharing your source code on the web, depending on what versioning system you're most comfortable with you might have a look at such well know sites as google code, bitbucket, github and many other.

Related

The best way to work with larger data queries?

After one month of learning and toying around with python and doing tons and tons of exercieses and samples I am still not really able to answer one important question to myself. If I generate data, for example with loading them from a xml or from scraping. How do I work with them? Right now I put each and every entrie directly into a sqlite db.
For exmaple:
I read a news feed, I put id, title, description, link, tags into one row in my sqlite db. I do some categorizing according to keywords within the the title or description. Each news entry goes into one row. After that I can read them row by row or at once.
I can filter them, sort them... But somehow I feel like there is a better way. Put them all with a dictionary into one big list and somehow work with this list. And then after I worked thru them sorted them or got even more infos with that data. Put them into the sqlite db. But none of my books or tutorial talk about this topic!? I try to make an example is a news article more important if a certain keyword comes up multiple times or in with an other keyword. Compare the news from that page with the news from another page with same keywords.
I guess u will be laughing and say. He is talking about ..... how can he not know that. But I am new to all of that. Sooorrry.
Thank you for your help guiding me in the right direction.

Python: RE vs. Query

I am building a website using Django, and this website uses blocks which are enabled for a certain page.
Right now I use a textfield containing paths were a block is enabled. When a page is requested, Django retrieves all blocks from database and does re.search on the TextField.
However, I was wondering if it is not a better idea to use a separate DB table for block/paths, were each row contains a single path and reference to a block, in terms of overhead.
A seperate DB table is definitely the "right" way to do it, because mysql has to send all the data from your TEXT fields every time you query. As you add more rows and the TEXT fields get bigger, you'll start to notice performance issues and eventually crash the server. Also, you'll be able to use VARCHAR and add a unique index to the paths, making lookups lightning fast.
I am not exactly familiar with Django, but if I am understanding the situation correctly, you should use a table.
In fact this is exactly the kind of use that DB software is designed and optimized for.
No worries. It will actually be faster.
By doing the search yourself, you are trying to implement part of the DB logic on your own. Fun, certainly, but not so fast. :)
Here are some nice links on designing a database:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
http://en.wikipedia.org/wiki/Third_normal_form
Hope this helps. Good luck. :-)

How to store wiki sites (vcs)

as a personal project I am trying to write a wiki with the help of django. I'm a beginner when it comes to web development. I am at the (early) point where I need to decide how to store the wiki sites. I have three approaches in mind and would like to know your suggestion.
Flat files
I considered a flat file approach with a version control system like git or mercurial. Firstly, I would have some example wikis to look at like http://hatta.sheep.art.pl/. Secondly, the vcs would probably deal with editing conflicts and keeping the edit history, so I would not have to reinvent the wheel. And thirdly, I could probably easily clone the wiki repository, so I (or for that matter others) can have an offline copy of the wiki.
On the other hand, as far as I know, I can not use django models with flat files. Then, if I wanted to add fields to a wiki site, like a category, I would need to somehow keep a reference to that flat file in order to associate the fields in the database with the flat file. Besides, I don't know if it is a good idea to have all the wiki sites in one repository. I imagine it is more natural to have kind of like a repository per wiki site resp. file. Last but not least, I'm not sure, but I think using flat files would limit my deploying capabilities because web hosts maybe don't allow creating files (I'm thinking, for example, of Google App Engine)
Storing in a database
By storing the wiki sites in the database I can utilize django models and associate arbitrary fields with the wiki site. I probably would also have an easier life deploying the wiki. But I would not get vcs features like history and conflict resolving per se. I searched for django-extensions to help me and I found django-reversion. However, I do not fully understand if it fit my needs. Does it track model changes like for example if I change the django model file, or does it track the content of the models (which would fit my need). Plus, I do not see if django reversion would help me with edit conflicts.
Storing a vcs repository in a database field
This would be my ideal solution. It would combine the advantages of both previous approaches without the disadvantages. That is; I would have vcs features but I would save the wiki sites in a database. The problem is: I have no idea how feasible that is. I just imagine saving a wiki site/source together with a git/mercurial repository in a database field. Yet, I somehow doubt database fields work like that.
So, I'm open for any other approaches but this is what I came up with. Also, if you're interested, you can find the crappy early test I'm working on here http://github.com/eugenkiss/instantwiki-test
In none of your choices have you considered whether you wish to be able to search your wiki. If this is a consideration, having the 'live' copy of each page in a database with full text search would be hugely beneficial. For this reason, I would personally go with storing the pages in a database every time - otherwise you'll have to create your own index somewhere.
As far as version logging goes, you only need store the live copy in an indexable format. You could automatically create a history item within your 'page' model when an changed page is written back to the database. You can cut down on the storage overhead of earlier page revisions by compressing the data, should this become necessary.
If you're expecting a massive amount of change logging, you might want to read this answer here:
How does one store history of edits effectively?
Creating a wiki is fun and rewarding, but there are a lot of prebuilt wiki software packages already. I suggest Wikipedia's List of wiki software. In particular, MoinMoin and Trac are good. Finally, John Sutherland has made a wiki using Django.

Efficient storage of and access to web pages with Python

So like many people I want a way to download, index/extract information and store web pages efficiently. My first thought is to use MySQL and simply shove the pages in which would let me use FULLTEXT searches which would let me do ad hoc queries easily (in case I want to see if something exists and extract it/etc.). But of course performance wise I have some concerns especially with large objects/pages and high volumes of data. So that leads me to look at things like CouchDB/search engines/etc. So to summarize, my basic requirements are:
It must be Python compatible (libraries/etc.)
Store meta data (URL, time retrieved, any GET/POST stuff I sent), response code, etc. of the page I requested.
Store a copy of the original web page as sent by the server (might be content, might be 404 search page, etc.).
Extract information from the web page and store it in a database.
Have the ability to do ad hoc queries on the existing corpus of original web pages (for example a new type of information I want to extract, or to see how many of the pages have the string "fizzbuzz" or whatever in them.
And of course it must be open source/Linux compatible, I have no interest in something I can't modify or fiddle with.
So I'm thinking several broad options are:
Toss everything into MySQL, use FULLTEXT, go nuts, shard the contact if needed.
Toss meta data into MySQL, store the data on the file system or something like CouchDB, write some custom search stuff.
Toss meta data into MySQL, store the data on a file system with a web server (maybe /YYYY/MM/DD/HH/MM/SS/URL/), make sure there is no default index.html/etc specified (directory index each directory in other words) and use some search engine like Lucene or Sphinx index the content and use that to search. Biggest downside I see here is the inefficiency of repeatedly crawling the site.
Other solutions?
When answering please include links to any technologies you mention and if possible what programming languages it has libraries for (i.e. if it's Scala only or whatever it's probably not that useful since this is a Python project). If this question has already been asked (I'm sure it must have been) please let me know (I searched, no luck).
Why do you think solution (3), the Sphinx-based one, requires "repeatedly crawling the site"? Sphinx can accept and index many different data sources, including MySQL and PostgreSQL "natively" (there are contributed add-ons for other DBs such as Firebird) -- you can keep your HTML docs as columns in your DB if you like (modern PostgreSQL versions should have no trouble with that, and I imagine that MySQL's wouldn't either), just use Sphinx superior indexing and full-text search (including stemming &c). Your metadata all comes from headers, after all (plus the HTTP request body if you want to track requests in which you POSTed data, but not the HTTP response body at any rate).
One important practical consideration: I would recommend standardizing on UTF-8 -- html will come to you in all sorts of weird encodings, but there's no need to get crazy supporting that at search time -- just transcode every text page to UTF-8 upon arrival (from whatever funky encoding it came in), before storing and indexing it, and live happily ever after.
Maybe you could special-case non-textual responses to keep those in files (I can imagine that devoting gigabytes in the DB to storing e.g. videos which can't be body-indexed anyway might not be a good use of resources).
And BTW, Sphinx does come with Python bindings, as you request.
You may be trying to achieve too much with the storage of the html (and supporting files). It seems you wish this repository would both
allow to display a particular page as it was in its original site
provide indexing for locating pages relevant to a particular search criteria
The html underlying a web page once looked a bit akin to a self-standing document, but the pages crawled off the net nowadays are much messier: javascript, ajax snippets, advertisement sections, image blocks etc.
This reality may cause you to rethink the one storage for all html approach. (And also the parsing / pre-processing of the material crawled, but that's another story...)
On the other hand, the distinction between metadata and the true text content associated with the page doesn't need to be so marked. By "true text content", I mean [possibly partially marked-up] text from the web pages that is otherwise free of all other "Web 2.0 noise") Many search engines, including Solr (since you mentioned Lucene) now allow mixing the two genres, in the form of semi-structured data. For operational purposes (eg to task the crawlers etc.), you may keep a relational store with management related metadata, but the idea is that for search purposes, fielded and free-text info can coexist nicely (at the cost of pre-processing much of the input data).
It sounds to me like you need a content management system. Check out Plone. If that's not what you want maybe a web framework, like Grok, BFG, Django, Turbogears, or anything on this list. If that isn't good either, then I don't know what you are asking. :-)

How to implement Google Suggest in your own web application (e.g. using Python)

In my website, users have the possibility to store links.
During typing the internet address into the designated field I would like to display a suggest/autocomplete box similar to Google Suggest or the Chrome Omnibar.
Example:
User is typing as URL:
http://www.sta
Suggestions which would be displayed:
http://www.staples.com
http://www.starbucks.com
http://www.stackoverflow.com
How can I achieve this while not reinventing the wheel? :)
You could try with
http://google.com/complete/search?output=toolbar&q=keyword
and then parse the xml result.
I did this once before in a Django server. There's two parts - client-side and server-side.
Client side you will have to send out XmlHttpRequests to the server as the user is typing, and then when the information comes back, display it. This part will require a decent amount of javascript, including some tricky parts like callbacks and keypress handlers.
Server side you will have to handle the XmlHttpRequests which will be something that contains what the user has typed so far. Like a url of
www.yoursite.com/suggest?typed=www.sta
and then respond with the suggestions encoded in some way. (I'd recommend JSON-encoding the suggestions.) You also have to actually get the suggestions from your database, this could be just a simple SQL call or something else depending on your framework.
But the server-side part is pretty simple. The client-side part is trickier, I think. I found this article helpful
He's writing things in php, but the client side work is pretty much the same. In particular you might find his CSS helpful.
Yahoo has a good autocomplete control.
They have a sample here..
Obviously this does nothing to help you out in getting the data - but it looks like you have your own source and arent actually looking to get data from Google.
If you want the auto-complete to use date from your own database, you'll need to do the search yourself and update the suggestions using AJAX as users type. For the search part, you might want to look at Lucene.
That control is often called a word wheel. MSDN has a recent walkthrough on writing one with LINQ. There are two critical aspects: deferred execution and lazy evaluation. The article has source code too.

Categories