Efficient storage of and access to web pages with Python - python

So like many people I want a way to download, index/extract information and store web pages efficiently. My first thought is to use MySQL and simply shove the pages in which would let me use FULLTEXT searches which would let me do ad hoc queries easily (in case I want to see if something exists and extract it/etc.). But of course performance wise I have some concerns especially with large objects/pages and high volumes of data. So that leads me to look at things like CouchDB/search engines/etc. So to summarize, my basic requirements are:
It must be Python compatible (libraries/etc.)
Store meta data (URL, time retrieved, any GET/POST stuff I sent), response code, etc. of the page I requested.
Store a copy of the original web page as sent by the server (might be content, might be 404 search page, etc.).
Extract information from the web page and store it in a database.
Have the ability to do ad hoc queries on the existing corpus of original web pages (for example a new type of information I want to extract, or to see how many of the pages have the string "fizzbuzz" or whatever in them.
And of course it must be open source/Linux compatible, I have no interest in something I can't modify or fiddle with.
So I'm thinking several broad options are:
Toss everything into MySQL, use FULLTEXT, go nuts, shard the contact if needed.
Toss meta data into MySQL, store the data on the file system or something like CouchDB, write some custom search stuff.
Toss meta data into MySQL, store the data on a file system with a web server (maybe /YYYY/MM/DD/HH/MM/SS/URL/), make sure there is no default index.html/etc specified (directory index each directory in other words) and use some search engine like Lucene or Sphinx index the content and use that to search. Biggest downside I see here is the inefficiency of repeatedly crawling the site.
Other solutions?
When answering please include links to any technologies you mention and if possible what programming languages it has libraries for (i.e. if it's Scala only or whatever it's probably not that useful since this is a Python project). If this question has already been asked (I'm sure it must have been) please let me know (I searched, no luck).

Why do you think solution (3), the Sphinx-based one, requires "repeatedly crawling the site"? Sphinx can accept and index many different data sources, including MySQL and PostgreSQL "natively" (there are contributed add-ons for other DBs such as Firebird) -- you can keep your HTML docs as columns in your DB if you like (modern PostgreSQL versions should have no trouble with that, and I imagine that MySQL's wouldn't either), just use Sphinx superior indexing and full-text search (including stemming &c). Your metadata all comes from headers, after all (plus the HTTP request body if you want to track requests in which you POSTed data, but not the HTTP response body at any rate).
One important practical consideration: I would recommend standardizing on UTF-8 -- html will come to you in all sorts of weird encodings, but there's no need to get crazy supporting that at search time -- just transcode every text page to UTF-8 upon arrival (from whatever funky encoding it came in), before storing and indexing it, and live happily ever after.
Maybe you could special-case non-textual responses to keep those in files (I can imagine that devoting gigabytes in the DB to storing e.g. videos which can't be body-indexed anyway might not be a good use of resources).
And BTW, Sphinx does come with Python bindings, as you request.

You may be trying to achieve too much with the storage of the html (and supporting files). It seems you wish this repository would both
allow to display a particular page as it was in its original site
provide indexing for locating pages relevant to a particular search criteria
The html underlying a web page once looked a bit akin to a self-standing document, but the pages crawled off the net nowadays are much messier: javascript, ajax snippets, advertisement sections, image blocks etc.
This reality may cause you to rethink the one storage for all html approach. (And also the parsing / pre-processing of the material crawled, but that's another story...)
On the other hand, the distinction between metadata and the true text content associated with the page doesn't need to be so marked. By "true text content", I mean [possibly partially marked-up] text from the web pages that is otherwise free of all other "Web 2.0 noise") Many search engines, including Solr (since you mentioned Lucene) now allow mixing the two genres, in the form of semi-structured data. For operational purposes (eg to task the crawlers etc.), you may keep a relational store with management related metadata, but the idea is that for search purposes, fielded and free-text info can coexist nicely (at the cost of pre-processing much of the input data).

It sounds to me like you need a content management system. Check out Plone. If that's not what you want maybe a web framework, like Grok, BFG, Django, Turbogears, or anything on this list. If that isn't good either, then I don't know what you are asking. :-)

Related

Design Question: Best 'place' to parse Scrapy text in Items?

This question is specifically on, from a architecture/design perspective, where is the best place to parse text obtained from a response object in Scrapy?
Context:
I'm learning Python and starting with scraping data from a popular NFL football database site
I've gotten all the data points I need, and have them stored in a local database (sqlite)
One thing I am scraping is a 'play by play', which collects the things that happen in every play. There is a descriptive text field that may say things like "Player XYZ threw a pass to Player ABC" or "Player 123 ran the ball up the middle".
I'd like to take that text field, parse it, and categorize it into general groups such as "Passing Play", "Rushing Play" etc based off certain keyword patterns.
My question is as follows: When and where is the best place to do that? Do I create my own middleware in Scrapy so that by the time it reaches the pipeline the item already has the categories and thus is stored in my database? Or do I just collect the scraped responses 'raw', store directly in my DB and do data cleaning in SQL after the fact, or even via a separate python script?
As mentioned, new to programming as a whole so I'm not sure what's best from a 'design' perspective
If you doing any scraping in scrapy, you will have to think about which item fields you want to use to collect the data. So figuring out what those fields are before you write your scraper is a first.
I don't necessarily think you need your own middleware unless your data is particularly needing work done in terms of requests and responses. The middlewares are mostly useful for processing reqeuest and responses rather than data manipulation/cleaning. i.e if you have duplicates or needing to change responses or add requests etc...
Scrapy is built for data extraction and has already a robust way of putting that information into a Dictionary-like API called an ItemsAdapter, which is essentially a wrapper for different ways of storing data.
There also ways to clean data in small ways and in larger ways within Scrapy. You can ItemLoaders which puts your items through a small function that can manipulate data or use a pipeline. Pipelines give you lots of flexibility in handling extracted data.
You will have to think about database design and what tables you're going to use, because ultimately that is where you will putting your data. It's quite easy to setup a Database Pipeline in Scrapy. The database pipeline is flexibly enough for you place data into any table you want using SQL queries.
Familiarising yourself with the Scrapy Architecture here might help you create a mental model of the process. You can see this here

Capturing information for customers such a referral URL and conversion

I was hoping to create my own in-house analytics so I tell my customers how many visits their company page got on my site and which URL they came from. I am coding this in Python (Flask) and I wondered if anyone could tell me what is the standard, or sensible approach to this problem.
I think it might be to have some sort of Redis queue which is triggered when a visitor comes and then this information is added to the database later so the site doesn't seem slow.
The standard, and sensible approach is to use Google Analytics. If you must roll your own, you have one of two approaches. JavaScript that is executed on every page (like GA) and pulls this kind of info into a DB. The second approach is parsing log files on the server. Awstats is a good bet for that.

Web scraping - Get keys from website

I've download all the files from the following website (which represents Bus routes of São Paulo - Brazil):
http://www.cruzalinhas.com/linha.json?key=12967
and was wondering if there is a more ellegant way of doing it.
Basically what I've done is a loop for all the values between 00000 and 99999, substituting ##### for this number
http://www.cruzalinhas.com/linha.json?key=#####
checking whether the website exists and, if does, downloading it.
Is there a way of knowing before all the keys in order to make this job more efficient?
I have all the files, but this is a very usual problem and i was wondering if there is a shortcut to solve it.
I'm the creator of Cruzalinhas, and apologize in advance for not having (yet) translated the site from Portuguese to English, which would otherwise would make these matters clearer (and this answer shorter).
As you may have noticed, the data shows transportation routes for buses, subways and trains in São Paulo in a way that makes it easier to find their connections to plan routes with connections (Google Maps and others make it automatically, sometimes missing routes that could be found with more interactive search).
It uses geohashes for "cheap" proximity search between routes), and crawls its data from the São Paulo Transportation Company (SPTrans).
To answer your first question: those IDs are the ones from the original site. However, they are not really much stable (I've seen they removing an old ID and replacing a new one just because a line changed routes), so Cruzalinhas does a full crawl every now and then and updates the entire database (I'd replace it completely, but Google App Engine makes it a bit harder than usual).
The good news: the site is open-sourced (http://github.com/chesterbr/cruzalinhas) under an MIT license. Documentation is also still in Portuguese, but you will be mostly interested in sptscraper, the command-line crawler.
The most efficient way to get the data is to do a sptscraper.py download, then sptscraper.py parse, then sptscraper.py dump and import from there. There are more options, and here is a quick translation of its help screen:
Downloads and parses data from public transportation routes from the SPTrans
website.
It parses HTML files and stores the result in the linhas.sqlite file, which
can be used in other applications, converted to JSON or used to update
cruzalinhas itself.
Commands:
info Shows the number of pending updates
download [id] Downloads HTML files from SPTrans (all or starting with id)
resume Continues the download from the last line saved.
parse Reads downloaded HTMLs and (soft) inserts/updates/deletes in
the database.
list Outputs a JSON with the route IDs from the database.
dump [id] Outputs a JSON with all routes in the database (or just one)
hashes Prints a JSON with the geohashes for each line (mapping to
the routes that cross the hash)
upload Uploads the pending changes in the database to cruzalinhas.
Keep in mind that this data is not taken with SPTrans consent, even though this is public information and they are legally obliged to do so. The site and the scraper were created as an act of protest against that, before the specific freedom of digital information law passed (even though there was already previous legislation regulating the availability of public service information, so no illegal act was conducted in this, or will be if you use it responsibily).
For that reason (and due to the fact that the back end is a bit... "challenged"), the scaraper is very careful in throttling the requests, in order to avoid overloading their servers. It makes the crawling span towards several hours, but you don't want to overload the service (which may force them into blocking you, or even changing the site to make crawling harder).
I'll eventually do a full rewrite of that code (was likely my first Python/App Engine code, written a few years ago, and a quick hack focused on exposing how useful this public data can be outside the confines of SPTrans' website). It will have a saner crawling process, should make the latest data available for download on a single click, and likely make a full lines list available on the API.
For now, if you just want the last crawling (which I did a month or two ago), just contact me and I'll be happy to send you the sqlite/JSON files.

Extract statistical information from Wikipedia article

I'm currently extracting data from DBpedia articles using a SPARQLWrapper for python, but I can't seem to find how to extract the number of watchers (and other statistical information) for a given article.
Is there an easy way to achieve this? I don't mind if it's through DBpedia, or directly through wikipedia (using wget, for example).
Thanks for any advice.
It shell be prohibited to get the number of watchers for every arbitrary article, as it is considered to be a security leak if everyone could find unwatched pages. For example, only privileged users have access to Special:Unwatched Pages. There is a toolserver tool (which has access to the DB) showing the number of watchers, but it is restricted to pages with more than 30 watchers for the same reasons - at least unauthenticated.
The MediaWiki query API exposes only mostly content and status information about articles, though you can query and evaluate the public logs or revision histories as well to get statistical data about (public) user actions. For more stats about the Wikimedia sites you may have a look at Meta:Statistics, where various data sources (mostly http://stats.wikimedia.org/) and visualisations of them are listed.

How to store wiki sites (vcs)

as a personal project I am trying to write a wiki with the help of django. I'm a beginner when it comes to web development. I am at the (early) point where I need to decide how to store the wiki sites. I have three approaches in mind and would like to know your suggestion.
Flat files
I considered a flat file approach with a version control system like git or mercurial. Firstly, I would have some example wikis to look at like http://hatta.sheep.art.pl/. Secondly, the vcs would probably deal with editing conflicts and keeping the edit history, so I would not have to reinvent the wheel. And thirdly, I could probably easily clone the wiki repository, so I (or for that matter others) can have an offline copy of the wiki.
On the other hand, as far as I know, I can not use django models with flat files. Then, if I wanted to add fields to a wiki site, like a category, I would need to somehow keep a reference to that flat file in order to associate the fields in the database with the flat file. Besides, I don't know if it is a good idea to have all the wiki sites in one repository. I imagine it is more natural to have kind of like a repository per wiki site resp. file. Last but not least, I'm not sure, but I think using flat files would limit my deploying capabilities because web hosts maybe don't allow creating files (I'm thinking, for example, of Google App Engine)
Storing in a database
By storing the wiki sites in the database I can utilize django models and associate arbitrary fields with the wiki site. I probably would also have an easier life deploying the wiki. But I would not get vcs features like history and conflict resolving per se. I searched for django-extensions to help me and I found django-reversion. However, I do not fully understand if it fit my needs. Does it track model changes like for example if I change the django model file, or does it track the content of the models (which would fit my need). Plus, I do not see if django reversion would help me with edit conflicts.
Storing a vcs repository in a database field
This would be my ideal solution. It would combine the advantages of both previous approaches without the disadvantages. That is; I would have vcs features but I would save the wiki sites in a database. The problem is: I have no idea how feasible that is. I just imagine saving a wiki site/source together with a git/mercurial repository in a database field. Yet, I somehow doubt database fields work like that.
So, I'm open for any other approaches but this is what I came up with. Also, if you're interested, you can find the crappy early test I'm working on here http://github.com/eugenkiss/instantwiki-test
In none of your choices have you considered whether you wish to be able to search your wiki. If this is a consideration, having the 'live' copy of each page in a database with full text search would be hugely beneficial. For this reason, I would personally go with storing the pages in a database every time - otherwise you'll have to create your own index somewhere.
As far as version logging goes, you only need store the live copy in an indexable format. You could automatically create a history item within your 'page' model when an changed page is written back to the database. You can cut down on the storage overhead of earlier page revisions by compressing the data, should this become necessary.
If you're expecting a massive amount of change logging, you might want to read this answer here:
How does one store history of edits effectively?
Creating a wiki is fun and rewarding, but there are a lot of prebuilt wiki software packages already. I suggest Wikipedia's List of wiki software. In particular, MoinMoin and Trac are good. Finally, John Sutherland has made a wiki using Django.

Categories