Python: RE vs. Query - python

I am building a website using Django, and this website uses blocks which are enabled for a certain page.
Right now I use a textfield containing paths were a block is enabled. When a page is requested, Django retrieves all blocks from database and does re.search on the TextField.
However, I was wondering if it is not a better idea to use a separate DB table for block/paths, were each row contains a single path and reference to a block, in terms of overhead.

A seperate DB table is definitely the "right" way to do it, because mysql has to send all the data from your TEXT fields every time you query. As you add more rows and the TEXT fields get bigger, you'll start to notice performance issues and eventually crash the server. Also, you'll be able to use VARCHAR and add a unique index to the paths, making lookups lightning fast.

I am not exactly familiar with Django, but if I am understanding the situation correctly, you should use a table.
In fact this is exactly the kind of use that DB software is designed and optimized for.
No worries. It will actually be faster.
By doing the search yourself, you are trying to implement part of the DB logic on your own. Fun, certainly, but not so fast. :)
Here are some nice links on designing a database:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
http://en.wikipedia.org/wiki/Third_normal_form
Hope this helps. Good luck. :-)

Related

Flask website backend structure guidance assistance?

I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.
This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.

When should I use objects.raw() in Django?

I am quite new to Python and I have seen that both
Entries.objects.raw("SELECT * FROM register_entries")
and
Entries.objects.filter()
do the same query.
In which cases is better to use one or the other?
It depends on the database backend that you are using. The first assume that you have a SQL-based database engine. That is not always true. At the opposite, the second one will work in any case (if the backend is designed for). There was for instance few years ago a LDAP backend which was designed so, but LDAP queries do not use SQL language at all.
In all cases, I advice you to use the second one. It is the better way to go if you want to make reusable and long-term code.
There are also other ideas to prefer the second one to the first one
avoiding possible SQL injections ;
no need to know about database design (table's name, fields' name) ;
generic code is better than specific one ;
and moreover, it is shorter...
But you sometimes will have to use the first one when you do specific operations (calling specific backend's functions), but avoid them as much as possible.
In a nutshell, use the second one!!!
From django documentation:
When the model query APIs don’t go far enough, you can fall back to writing raw SQL
For all aspects, django queryset api offers you many options to customize your queries. But in some cases, you need to use very specific queries where django api become insufficient. But before you go for raw SQL, it is best to read all Queryset Api docs and learn everything about django queryset api.

Couchdb/Mongodb Application/Logic layer, like Oracle DB

At my work, we use Oracle for our database. Which works great. I am not the main db admin, but I do work with it. One thing I like is that the DB has a built in logic layer using PL/SQL which ca handle logic related to saving the data and retrieve it. I really like this because it allows our MVC application (PHP/Zend Framework) to be lighter, and makes it easier to tie in another platform into the data, such as desktop or mobile.
Although, I have a personal project where I want to use couchdb or mongodb, and I want to try and accomplish a similar goal. outside of the mvc/framework, I want to have an API layer that the main applications talk to. they dont actually talk directly to the database. They specify the design document (couchdb) or something similar for mongo, to get the results. And that API layer will validate the incoming data and make sure that data itself is saved and updated properly. Such as saving a new user, in the framework I only need to send a json obejct with the keys/values that need to be saved and the api layer saves the data in the proper places where needed.
This API would probably have a UI, but only for administrative purposes and to make my life easier. In general it will always reply with json strings, or pre-rendered/cached html in some cases. Since each api layer would be specific to the application anyways.
I was wondering if anyone has done anything like this, or had any tips on nethods I could accomplish this. I am currently looking to write my application in python, and the front end will likely be something like Angularjs. Although I am also looking at node.js for a back end.
We do this exact thing at my current job. We have MongoDB on the back end, a RESTful API on top of it and then PHP/Zend on the front end.
Most of our data is read only, so we import that data into MongoDB and then the RESTful API (in Java) just serves it up.
Some things to think about with this approach:
Write generic sorting/paging logic in your API. You'll need this for lists of data. The user can pass in things like http://yourapi.com/entity/1?pageSize=10&page=3.
Make sure to create appropriate indexes in Mongo to match what people will query on. Imagine you are storing users. Make an index in Mongo on the user id field, or just use the _id field that is already indexed in all your calls.
Make sure to include all relevant data in a given document. Mongo doesn't do joins like you're used to in Oracle. Just keep in mind modeling data is very different with a document database.
You seem to want to write a layer (the middle tier API) that is database agnostic. That's a good goal. Just be careful not to let Mongo specific terminology creep into your exposed API. Mongo has specific operators/concepts that you'll need to mask with more generic terms. For example, they have a $set operator. Don't expose that directly.
Finally after having a decent amount of experience with CouchDB and Mongo, I'd definitely go with Mongo.

Creating database schema for parsed feed

Additional questions regarding SilentGhost's initial answer to a problem I'm having parsing Twitter RSS feeds. See also partial code below.
First, could I insert tags[0], tags[1], etc., into the database, or is there a different/better way to do it?
Second, almost all of the entries have a url, but a few don't; likewise, many entries don't have the hashtags. So, would the thing to do be to create default values for url and tags? And if so, do you have any hints on how to do that? :)
Third, when you say the single-table db design is not optimal, do you mean I should create a separate table for tags? Right now, I have one table for the RSS feed urls and another table with all the rss entry data (summar.y, date, etc.).
I've pasted in a modified version of the code you posted. I had some success in getting a "tinyurl" variable to get into the sqlite database, but now it isn't working. Not sure why.
Lastly, assuming I can get the whole thing up and running (smile), is there a central site where people might appreciate seeing my solution? Or should I just post something on my own blog?
Best,
Greg
I would suggest reading up on database normalisation, especially on 1st and 2nd normal forms. Once you're done with it, I hope there won't be need for default values, and your db schema evolves into something more appropriate.
There are plenty of options for sharing your source code on the web, depending on what versioning system you're most comfortable with you might have a look at such well know sites as google code, bitbucket, github and many other.

A python web application framework for tight DB/GUI coupling?

I'm a firm believer of the heretic thought of tight coupling between the backend and frontend: I want existing, implied knowledge about a backend to be automatically made use of when generating user interfaces. E.g., if a VARCHAR column has a maximum with of 20 characters, there GUIs should automatically constrain the user from typing more than 20 characters in a related form field.
And I have strong antipathy to ORMs which want to define my database tables, or are based on some hack where every table needs to have extra numeric ID columns because of the ORM.
I've looked a bit into Python database frameworks and I think I can conclude the SQLAlchemy fits best to my mentality.
Now, I need to find a web application framework which fits naturally with SQLAlchemy (or an equivalent) and perhaps even with my appetite for coupling. With "web application framework", I mean products/project such as Pyhons, Django, TurboGears, web2py, etc.
E.g., it should ideally be able to:
automatically select a suitable form widget for data entering a given column if told to do so; e.g., if the column has a foreign key to a column with 10 different values, widget should display the 10 possible values as a dropdown
auto-generate javascript form validation code which gives the end-user quick error feedback if a string is entered into a field which is about to end up in an INTEGER column, etc
auto-generate a calendar widget for data which will end up in a DATE column
hint NOT NULL constraints as javascript which complains about empty or whitespace-only data in a related input field
generate javascript validation code which matches relevant (simple) CHECK-constraints
make it easy to avoid SQL injection, by using prepared statements and/or validation of externally derived data
make it easy to avoid cross site scripting by automatically escape outgoing strings when appropriate
make use of constraint names to generate somewhat user friendly error messages in case a constrataint is violated
All this should happen dynamically, so table adjustments are automatically reflected on the frontend - probably with a caching mechanism, so that all the model introspection wouldn't kill performance. In other words, I don't want to repeat my model definition in an XML file (or alike) when it has already been carefully been defined in my database.
Does such a framework exist for Python (or for any language, for that matter)? If not: Which of the several Python web application frameworks will be least in the way if I were to add parts of the above features myself?
web2py does most of what you ask:
Based on a field type and its validators it will render the field with the appropriate widget. You can override with
db.table.field.widget=...
and use a third party widget.
web2py has js to blocks the user from entering a non-integer in a integer field or a non-double in a double field. time, date and datetime fields have their own pickers. These js validation work with (not instead) of server side validation.
There is IS_EMPTY_OR(...) validator.
The DAL prevents SQL injections since everthing is escaped when goes in the DB.
web2py prevents XSS because in {{=variable}}, 'variable' is escaped unless specified otherwise {{=XML(variable)}} or {{=XML(variable,sanitize=True)}}
Error messages are arguments of validators for example
db.table.field.requires=IS_NOT_EMPTY(error_message=T('hey! write something in here'))
T is for internationalization.
You should have a look at django and especially its newforms and admin modules. The newforms module provides a nice possibility to do server side validation with automated generation of error messages/pages for the user. Adding ajax validation is also possible
TurboGears currently uses SQLObject by default but you can use it with SQLAlchemy. They are saying that the next major release of TurboGears (1.1) will use SQLAlchemy by default.
I know that you specificity ask for a framework but I thought I would let you know about what I get up to here. I have just undergone converting my company's web application from a custom in-house ORM layer into sqlAlchemy so I am far from an expert but something that occurred to me was that sqlAlchemy has types for all of the attributes it maps from the database so why not use that to help output the right html onto the page. So we use sqlAlchemy for the back end and Cheetah templates for the front end but everything in between is basically our own still.
We have never managed to find a framework that does exactly what we want without compromise and prefer to get all the bits that work right for us and write the glue our selves.
Step 1. For each data type sqlAlchemy.types.INTEGER etc. Add an extra function toHtml (or many maybe toHTMLReadOnly, toHTMLAdminEdit whatever) and just have that return the template for the html, now you don't even have to care what data type your displaying if you just want to spit out a whole table you can just do (as a cheetah template or what ever your templating engine is).
Step 2
<table>
<tr>
#for $field in $dbObject.c:
<th>$field.name</th>
#end for
</tr>
<tr>
#for $field in dbObject.c:
<td>$field.type.toHtml($field.name, $field.value)</td>
#end for
</tr>
</table>
Using this basic method and stretching pythons introspection to its potential, in an afternoon I managed to make create read update and delete code for our whole admin section of out database, not yet with the polish of django but more then good enough for my needs.
Step 3 Discovered the need for a third step just on Friday, wanted to upload files which as you know needs more then just the varchar data types default text box. No sweat, I just overrode the rows class in my table definition from VARCHAR to FilePath(VARCHAR) where the only difference was FilePath had a different toHtml method. Worked flawlessly.
All that said, if there is a shrink wrapped one out there that does just what you want, use that.
Disclaimer: This code was written from memory after midnight and probably wont produce a functioning web page.
I believe that Django models does not support composite primary keys (see documentation). But perhaps you can use SQLAlchemy in Django? A google search indicates that you can. I have not used Django, so I don't know.
I suggest you take a look at:
ToscaWidgets
DBSprockets, including DBMechanic
Catwalk. Catwalk is an application for TurboGears 1.0 that uses SQLObject, not SQLAlchemy. Also check out this blog post and screencast.
FastData. Also uses SQLObject.
formalchemy
Rum
I do not have any deep knowledge of any of the projects above. I am just in the process of trying to add something similar to one of my own applications as what the original question mentions. The above list is simply a list of interesting projects that I have stumbled across.
As to web application frameworks for Python, I recommend TurboGears 2. Not that I have any experience with any of the other frameworks, I just like TurboGears...
If the original question's author finds a solution that works well, please update or answer this thread.

Categories