Recommendation for click/event tracking mechanisms (python, django, celery, mongo etc)

Recommendation for click/event tracking mechanisms (python, django, celery, mongo etc) - python

I'm looking into way to track events in a django application (events would generally be clicks tied to a specific unique user id).
These events would essentially contain an event type like "click" and then each click event would be assigned to a unique id (many events can go to one id) and each event would have a data set including items like referrer etc...
I have tried mixpanel, but for now the data api they are offering seems too limiting as I can't seem to find a way to get all of my data out by a unique id (apart from the event itself).
I'm looking into using django-eventracker, but curious about any others thought on the best way to do this. Mongo or CouchDb seem like a great choice here, but the celery/rabbitmq looks really attractive with mongo. Pumping these events into the existing applications db seems limiting at this point.
Anyways, this is just a thread to see what others thoughts are on this and how they have implemented something like this...
shoot

I am not familiar with the pre-packaged solutions you mention. Were I to design this from scratch, I'd have a simple JS collecting info on clicks and posting it back to the server via Ajax (using whatever JS framework you're already using), and on the server side I'd simply append that info to a log file for later "offline" processing -- so that would be independent of django or other server-side framework, essentially.
Appending to a log file is a very light-weight action, while DBs for web use are generally way optimized for read-intensive (not write-intensive) operation, so I agree with you that force fitting that info (as it trickes in) into the existing app's DB is unlikely to offer good performance.

You probably want to keep a flexible format for your logs to anticipate future needs or changes. In this sense, the schema-less document-oriented databases are nice. One advantage is that the structure of your data will be close to your application needs for whatever analyses you perform later (so, avoiding some of the inevitable parsing/data munging work).
If you're thinking about using mysql, postgresql or such, then you should look into something like rsyslog for buffering writes and avoiding the performance penalty with heavy logging. (I can't say much about celery and other queueing mechanisms for this type of thing, but they sound promising.)
Mongodb has a some nice features that make it amenable to logging such as capped collections. A summary can be found in this post.

If by click, you mean a click on a link that loads a new page (or performs an AJAX request), then what you aim to do is fairly straightforward. Web servers tend to keep plain-text logs about requests - with information about the user, time/date, referrer, the page requested, etc. You could examine these logs and mine the statistics you need.
On the other hand, if you have a web application where clicks don't necessarily generate server requests, then collecting click information with javascript is your best bet.

Related

Handling Notifications to Users

I am building an application on GAE that needs to notify users when another user performs an action that affects them. A real world analogy would be being alerted when your friend comments on your facebook status.
I understand how the Channel API works to actually send notifications in real time, but I'm trying to understand the most effective way to store those notifications in the datastore. Ideally, I want the notification code to be decoupled from the actual event being performed. Is this a good use case for Prospective Search? It doesn't quite feel right since I don't need to perform any kind of searching, just: if you see a new comment, create a new notification that is stored in the datastore and pushed to the client through the channel api if they are connected. I basically need a database trigger, but I don't think GAE supports that.

Why don't you want to couple the event and its notifications in the first place?
I think it may be interesting to know in order to help you with your use case :)
If I had to do this I would launch a task queue anytime I write to the datastore something that might fire events...
That way you can do your write and have a separate "layer" to process the events.
Triggers would not even be that good of an option, since your application would have to "poll" the database to push events to the users' UI.
I think your process (firing events) does not belong to the database, since it might as well need business rules that the datastore cannot provide : for example when a users ignores another one, you should not fire events.
The more business logic you put in your database system, the more complex it gets to maintain & scale IMHO...

Looks like GAE does support mimicking database triggers using hooks.
Hooks can be useful for
query caching
auditing Datastore activity per-user
mimicking database triggers

Server Topology Help - Django and Twisted Possibility?

I am currently working on a complex web interface and backend, that will need to address several issues.
Scalablility
multiple deployments of varying load demands
Very structured authorization groups
Different views for different user groups
admin panel
user/content management
Large managed database
current
long term stored data (histories)
Data Updates
Polling
Ex. Search queries, static pages/files, report generation per request
Pushing (likely websockets)
Ex. Real-time notifications
Varying protocols
Ex. HTTP, SSL, Websockets
I would like to use Python, because I have grown to really enjoy the language, and I am considering some combo of Django and Twisted.
I have some experience with Django, which I love for its MVT style of application programming, its authorization models, its admin panel, and its database API. However, it is not so strong in some of the data requirements that I need, in particular, the real-time aspects.
Now, I have not really used Twisted before, but I have seen many interesting things to it. In particular the async aspects, and the ability to run many protocols.
The problems in getting the two to work together are obvious in that Django is a blocking server and Twisted is designed to be non-blocking. I have seen some topics stating using the two together is possible and have had success with it. It also seems possible to run both and proxy them to accept different urls, but getting the authentication over the two may become tricky?
Having said all of that, I would like to ask if I am on the right track for implementing this system, as well as suggestions on how to use the two together, alternatives, or if I should just kick one out (at this point, I guess it'd have to be Django, because the real time stuff is necessary). I should mention that I have written some of the preliminary data models and views in Django already.
I am quite experienced on the client side of things (JS,CSS,HTML), but I am not so savvy in the server side of things. Any input would be helpful, thanks.

You can definitely use Twisted with Django. Several projects have used the two together to good effect. twistd web --wsgi provides a basic way to get it set up, and there's a great example with more bells and whistles, like static content by Alex Clemesha on github.

django/python - what's the recommended secure way to exchange data between my infrastructure and my customers?

I'm using Django/Postgres and Python for my web site and the background processes. I have hundreds of messages every minute populating my database and I would like to securely allow other customers access their data.
My customers use either linux or windows so I would like some solution that will be platform/database agnostic.
I looked so far at Piston , Twisted , Celery and RabbitMQ. All these have some way to exchange data. But I'm not sure what to use or if there are any better options.
For example I need the customers to be able to access only their data on my database. Another thing I need is to allow the customers to send a short command back to my servers. My servers will execute the command and return re result in real time back to the customer.
Any ideas?

You asked how your customers can securely transmit commands to your website and retrieve results in their response (near "real-time").
... have you considered hooking a reasonable API into your django app? If you're concerned about security, you can use authentication and serve it over HTTPS.
It's not as fancy as the messaging and queuing platforms that the kids are using these days but it'll get the job done.
Things to like about HTTP/HTTPS APIs:
They can be load balanced (highly available and scalable!)
They can be cached (mo' betta performance and the ability to still serve content while rate limiting how often a client can hit the DB)
Just about every programming language has a mature library that allows HTTP/HTTPS connections. Some have multiple, e.g. Python: urllib,urllib2,httplib

Quick test if web-page offer asynchronous HTTP?

I am fetching current data from another company's web feed. It is a simple fetch of an XML file over HTTP. They haven't provided me with much documentation - just a URL.
Because I need to know as soon as possible when the data changes on their site, I need to poll frequently, which isn't a satisfactory solution for either side.
I was about to recommend to them that they set-up some sort of server push - presumably a long-term HTTP connection with asynchronous updates being sent by the server. I am not very familiar with any common protocols for this. It occurred to me that they may already offer this, and I have been too ignorant to realise.
Is there a common web-based protocol for server pushes over HTTP? If there is, is there a quick way I can check if they support it before I make myself look foolish by asking for something that is already available.
(Bonus points for a platform-independent, Python-based solution, but I will take what I can get.)

What you want is HTTP Streaming; read this page. "Comet" is what this technology is commonly called. One implementation is the Ajax Push Engine (APE); the page I just gave you has several others.
Now I don't think it's possible to automatically test if a server supports a push technology because as of now there are no standards on this and the protocols used will vary depending on the implementation.
Alternatively you can use periodic refresh ("polling"), and the advantages of this technique are: you don't need additional software on the server, and this can be done without the cooperation of the server you are polling (it is unfeasible to use Comet if the server you are querying won't install it).
For more information and tricks to reduce bandwidth usage on polling, see this page. Some of these will require some effort from the server you are polling.

I suggest you read this Wikipedia article on the subject. What you want is certainly possible, however it may not be supported by all browsers.
That said... I generally recommend against push technologies on the web, as they sap the resources of a server much faster than a request/response paradigm.
Perhaps there's another way? Polling frequently to see if the file changed is at least a small payload... why is it unsatisfactory for both sides?
Unless you can get the other company to change some of its practices -- perhaps to FTP you the new file, or call a webservice to let your company know that the file has changed -- you may be stuck with polling.

I'm not aware of any method to test if a web server support a push technology.
You should ask to that company if a Comet approach could be adopted to avoid polling.
For Comet python-based solution, have a look here.

To avoid unnecessary download I would check etags and Last-modified headers as described here
http://diveintopython3.ep.io/http-web-services.html

Data Synchronization framework / algorithm for server<->device?

I'm looking to implement data synchronization between servers and distributed clients. The data source on the server is mysql with django on top. The client can vary. Updates can take place on either client or server, and the connection between server and client is not reliable (eg. changes can be made on a disconnected cell phone, should get sync'd when the cell phone has a connection again).
S. Lott suggests using a version control design pattern in this question, which makes sense. I'm wondering if there are any existing packages / implementations of this I can use. Or, should I directly make use of svn/git/etc?
Are there other alternatives? There must be synchronization frameworks or detailed descriptions of algorithms out there, but I'm not having a lot of luck finding them. I'd appreciate if you point me in the right direction.

Perhaps using plain old rsync is enough.

AFAIK there isnt any generic solution to this mainly due to the diverse requirements for synchronization.
In one of our earlier projects we implemented a Spring batching based sync mechanism which relies on last updated timestamp field on each of the tables (that take part in sync).
I have heard about SyncML but dont have much experience with that.
If you have a single server and multiple clients, you could think of a JMS based approach.
The data is bundled and placed in Queues (or topics) and would be pulled by clients.
In your case, since updates are bi-directional, you need to handle conflict detection as well. This brings additional complexities.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.