How do I pass around data between pages in a GAE app? - python

I have a fairly basic GAE app that takes some input, fetches some data from a webpage, parses, then presents it to the user. Right now, the fairly spare input HTML form POSTs the arguments to the output 'file' which is wholly generated by the handler for that URL.
I'd like to do a couple things with the data (e.g. graph it at a landing page perhaps, then write it to an output file), but I don't know how I should pass the parsed data between the different handlers. I could maybe encode it then successively POST it to other handlers, but my gut says that I shouldn't need to HTTP the data back and forth within my app—it seems terribly inefficient (my gut is also hungry...).
In fairly broad swaths (or maybe a link to an example), how should my handlers handle this?
Later thoughts (edit)
My very rough idea now is to have the form submitted to a page that 1) enters the subsequent query into a database (datastore?) keyed to some hash, then uses that to 2) grab and parse all the data. The parsed data would be stored in memory (memcache?) for near-immediate use to graph it and/or process it into a variety of tabular formats for download. The script that does said parsing redirects to a unique URL based on the hash which can be referred to in order to get the data.
The thought would be that you could save the URL, then if you visit it later when the data has been lost, it can re-query the source to get it back/update it.
Reasonable? Should I be looking at other things?

Related

Design Question: Best 'place' to parse Scrapy text in Items?

This question is specifically on, from a architecture/design perspective, where is the best place to parse text obtained from a response object in Scrapy?
Context:
I'm learning Python and starting with scraping data from a popular NFL football database site
I've gotten all the data points I need, and have them stored in a local database (sqlite)
One thing I am scraping is a 'play by play', which collects the things that happen in every play. There is a descriptive text field that may say things like "Player XYZ threw a pass to Player ABC" or "Player 123 ran the ball up the middle".
I'd like to take that text field, parse it, and categorize it into general groups such as "Passing Play", "Rushing Play" etc based off certain keyword patterns.
My question is as follows: When and where is the best place to do that? Do I create my own middleware in Scrapy so that by the time it reaches the pipeline the item already has the categories and thus is stored in my database? Or do I just collect the scraped responses 'raw', store directly in my DB and do data cleaning in SQL after the fact, or even via a separate python script?
As mentioned, new to programming as a whole so I'm not sure what's best from a 'design' perspective
If you doing any scraping in scrapy, you will have to think about which item fields you want to use to collect the data. So figuring out what those fields are before you write your scraper is a first.
I don't necessarily think you need your own middleware unless your data is particularly needing work done in terms of requests and responses. The middlewares are mostly useful for processing reqeuest and responses rather than data manipulation/cleaning. i.e if you have duplicates or needing to change responses or add requests etc...
Scrapy is built for data extraction and has already a robust way of putting that information into a Dictionary-like API called an ItemsAdapter, which is essentially a wrapper for different ways of storing data.
There also ways to clean data in small ways and in larger ways within Scrapy. You can ItemLoaders which puts your items through a small function that can manipulate data or use a pipeline. Pipelines give you lots of flexibility in handling extracted data.
You will have to think about database design and what tables you're going to use, because ultimately that is where you will putting your data. It's quite easy to setup a Database Pipeline in Scrapy. The database pipeline is flexibly enough for you place data into any table you want using SQL queries.
Familiarising yourself with the Scrapy Architecture here might help you create a mental model of the process. You can see this here

GTFS route planner

I am in the process of creating an application which can tell people when a bus leaves from a certain stop and I would like to add route planning to it.
I need a way to plan routes from a stop to another in a couple of seconds. I'm getting my data from a GTFS file parsed to SQLite
I have looked at OpenTripPlanner and GraphServer, but I couldn't find an API which can plan routes and give those routes back in a JSON or some other format.
You may have overlooked in the OpenTripPlanner documentation that it does give you an option to give a JSON or XML response.
Have a look at this specific section: http://dev.opentripplanner.org/apidoc/0.20.0/json_Response.html

Get a JSON tree of all comments of a post?

I'm looking to backup a subreddit to disk. So far, it doesn't seem to be easily possible with the way that the Reddit API works. My best bet at getting a single JSON tree with all comments (and nested comments) would seem to be storing them inside of a database and doing a pretty ridiculous recursive query to generate the JSON.
Is there a Reddit API method which will give me a tree containing all comments on a given post in the expected order?
The number of comments you get from the API has a hard limit, for performance reasons; to ensure you're getting all comments, you have to parse through the child nodes and make additional calls as necessary.
Be aware that the subreddit listing will only include the latest 1000 posts, so if your target subreddit has more than that, you probably won't be able to obtain a full backup anyways.

Sending objects from Jinja Templates to Python

I started using Jinja Templating with Python to develop web apps. With Jinja, I am able to send objects from my Python code to my index.html, but is it possible to receive objects from my index.html to my Python code? For example, passing a list back and forth. If so, do you have any examples?
Thank You!
Why do this? Any logic that you implement in the template is accessible to you in the controller of your app, including any variables that you place in the template context.
If the data has been changed due to interaction with the user, then the best way to retrieve data, in my opinion, is to set up a form and use the normal POST method to send the request and the required data, correctly encoded and escaped, back to your program. In this way, you are protected from XSS issues, among other inconveniences. I would never do any processing in a template, and only use any local logic to modify the presentation itself.
EDIT Taking into account your scenario, I suggest the following:
User presses a button on a page and invokes a Get handler
Get handler queries a database and receives a list of images the list is cached, maybe in a memcache and the key is sent with the list of images encoded as a parameter in the GET URL displayed by the template
List of images get passed to the template engine for display
Another button is pressed and a different Get handler is invoked using the key received encoded in the GET URL, after sanitising and validation, to retrieve the cached list
If you don't want the intermediate step of caching a key-value pair, you may want to encode the whole list in the GET URL, and the step of sanitising and validation should be as easy on the whole list as on a key to the list. Both methods avoid a round trip to the database, protect you from malicious use, and respect the separation of data, presentation, and logic.
Just a thought.. Have you tried accessing the variables in the dict you passed to jinja after processing the template?

Efficient storage of and access to web pages with Python

So like many people I want a way to download, index/extract information and store web pages efficiently. My first thought is to use MySQL and simply shove the pages in which would let me use FULLTEXT searches which would let me do ad hoc queries easily (in case I want to see if something exists and extract it/etc.). But of course performance wise I have some concerns especially with large objects/pages and high volumes of data. So that leads me to look at things like CouchDB/search engines/etc. So to summarize, my basic requirements are:
It must be Python compatible (libraries/etc.)
Store meta data (URL, time retrieved, any GET/POST stuff I sent), response code, etc. of the page I requested.
Store a copy of the original web page as sent by the server (might be content, might be 404 search page, etc.).
Extract information from the web page and store it in a database.
Have the ability to do ad hoc queries on the existing corpus of original web pages (for example a new type of information I want to extract, or to see how many of the pages have the string "fizzbuzz" or whatever in them.
And of course it must be open source/Linux compatible, I have no interest in something I can't modify or fiddle with.
So I'm thinking several broad options are:
Toss everything into MySQL, use FULLTEXT, go nuts, shard the contact if needed.
Toss meta data into MySQL, store the data on the file system or something like CouchDB, write some custom search stuff.
Toss meta data into MySQL, store the data on a file system with a web server (maybe /YYYY/MM/DD/HH/MM/SS/URL/), make sure there is no default index.html/etc specified (directory index each directory in other words) and use some search engine like Lucene or Sphinx index the content and use that to search. Biggest downside I see here is the inefficiency of repeatedly crawling the site.
Other solutions?
When answering please include links to any technologies you mention and if possible what programming languages it has libraries for (i.e. if it's Scala only or whatever it's probably not that useful since this is a Python project). If this question has already been asked (I'm sure it must have been) please let me know (I searched, no luck).
Why do you think solution (3), the Sphinx-based one, requires "repeatedly crawling the site"? Sphinx can accept and index many different data sources, including MySQL and PostgreSQL "natively" (there are contributed add-ons for other DBs such as Firebird) -- you can keep your HTML docs as columns in your DB if you like (modern PostgreSQL versions should have no trouble with that, and I imagine that MySQL's wouldn't either), just use Sphinx superior indexing and full-text search (including stemming &c). Your metadata all comes from headers, after all (plus the HTTP request body if you want to track requests in which you POSTed data, but not the HTTP response body at any rate).
One important practical consideration: I would recommend standardizing on UTF-8 -- html will come to you in all sorts of weird encodings, but there's no need to get crazy supporting that at search time -- just transcode every text page to UTF-8 upon arrival (from whatever funky encoding it came in), before storing and indexing it, and live happily ever after.
Maybe you could special-case non-textual responses to keep those in files (I can imagine that devoting gigabytes in the DB to storing e.g. videos which can't be body-indexed anyway might not be a good use of resources).
And BTW, Sphinx does come with Python bindings, as you request.
You may be trying to achieve too much with the storage of the html (and supporting files). It seems you wish this repository would both
allow to display a particular page as it was in its original site
provide indexing for locating pages relevant to a particular search criteria
The html underlying a web page once looked a bit akin to a self-standing document, but the pages crawled off the net nowadays are much messier: javascript, ajax snippets, advertisement sections, image blocks etc.
This reality may cause you to rethink the one storage for all html approach. (And also the parsing / pre-processing of the material crawled, but that's another story...)
On the other hand, the distinction between metadata and the true text content associated with the page doesn't need to be so marked. By "true text content", I mean [possibly partially marked-up] text from the web pages that is otherwise free of all other "Web 2.0 noise") Many search engines, including Solr (since you mentioned Lucene) now allow mixing the two genres, in the form of semi-structured data. For operational purposes (eg to task the crawlers etc.), you may keep a relational store with management related metadata, but the idea is that for search purposes, fielded and free-text info can coexist nicely (at the cost of pre-processing much of the input data).
It sounds to me like you need a content management system. Check out Plone. If that's not what you want maybe a web framework, like Grok, BFG, Django, Turbogears, or anything on this list. If that isn't good either, then I don't know what you are asking. :-)

Categories