Web scraping - Get keys from website - python

I've download all the files from the following website (which represents Bus routes of São Paulo - Brazil):
http://www.cruzalinhas.com/linha.json?key=12967
and was wondering if there is a more ellegant way of doing it.
Basically what I've done is a loop for all the values between 00000 and 99999, substituting ##### for this number
http://www.cruzalinhas.com/linha.json?key=#####
checking whether the website exists and, if does, downloading it.
Is there a way of knowing before all the keys in order to make this job more efficient?
I have all the files, but this is a very usual problem and i was wondering if there is a shortcut to solve it.

I'm the creator of Cruzalinhas, and apologize in advance for not having (yet) translated the site from Portuguese to English, which would otherwise would make these matters clearer (and this answer shorter).
As you may have noticed, the data shows transportation routes for buses, subways and trains in São Paulo in a way that makes it easier to find their connections to plan routes with connections (Google Maps and others make it automatically, sometimes missing routes that could be found with more interactive search).
It uses geohashes for "cheap" proximity search between routes), and crawls its data from the São Paulo Transportation Company (SPTrans).
To answer your first question: those IDs are the ones from the original site. However, they are not really much stable (I've seen they removing an old ID and replacing a new one just because a line changed routes), so Cruzalinhas does a full crawl every now and then and updates the entire database (I'd replace it completely, but Google App Engine makes it a bit harder than usual).
The good news: the site is open-sourced (http://github.com/chesterbr/cruzalinhas) under an MIT license. Documentation is also still in Portuguese, but you will be mostly interested in sptscraper, the command-line crawler.
The most efficient way to get the data is to do a sptscraper.py download, then sptscraper.py parse, then sptscraper.py dump and import from there. There are more options, and here is a quick translation of its help screen:
Downloads and parses data from public transportation routes from the SPTrans
website.
It parses HTML files and stores the result in the linhas.sqlite file, which
can be used in other applications, converted to JSON or used to update
cruzalinhas itself.
Commands:
info Shows the number of pending updates
download [id] Downloads HTML files from SPTrans (all or starting with id)
resume Continues the download from the last line saved.
parse Reads downloaded HTMLs and (soft) inserts/updates/deletes in
the database.
list Outputs a JSON with the route IDs from the database.
dump [id] Outputs a JSON with all routes in the database (or just one)
hashes Prints a JSON with the geohashes for each line (mapping to
the routes that cross the hash)
upload Uploads the pending changes in the database to cruzalinhas.
Keep in mind that this data is not taken with SPTrans consent, even though this is public information and they are legally obliged to do so. The site and the scraper were created as an act of protest against that, before the specific freedom of digital information law passed (even though there was already previous legislation regulating the availability of public service information, so no illegal act was conducted in this, or will be if you use it responsibily).
For that reason (and due to the fact that the back end is a bit... "challenged"), the scaraper is very careful in throttling the requests, in order to avoid overloading their servers. It makes the crawling span towards several hours, but you don't want to overload the service (which may force them into blocking you, or even changing the site to make crawling harder).
I'll eventually do a full rewrite of that code (was likely my first Python/App Engine code, written a few years ago, and a quick hack focused on exposing how useful this public data can be outside the confines of SPTrans' website). It will have a saner crawling process, should make the latest data available for download on a single click, and likely make a full lines list available on the API.
For now, if you just want the last crawling (which I did a month or two ago), just contact me and I'll be happy to send you the sqlite/JSON files.

Related

Design Question: Best 'place' to parse Scrapy text in Items?

This question is specifically on, from a architecture/design perspective, where is the best place to parse text obtained from a response object in Scrapy?
Context:
I'm learning Python and starting with scraping data from a popular NFL football database site
I've gotten all the data points I need, and have them stored in a local database (sqlite)
One thing I am scraping is a 'play by play', which collects the things that happen in every play. There is a descriptive text field that may say things like "Player XYZ threw a pass to Player ABC" or "Player 123 ran the ball up the middle".
I'd like to take that text field, parse it, and categorize it into general groups such as "Passing Play", "Rushing Play" etc based off certain keyword patterns.
My question is as follows: When and where is the best place to do that? Do I create my own middleware in Scrapy so that by the time it reaches the pipeline the item already has the categories and thus is stored in my database? Or do I just collect the scraped responses 'raw', store directly in my DB and do data cleaning in SQL after the fact, or even via a separate python script?
As mentioned, new to programming as a whole so I'm not sure what's best from a 'design' perspective
If you doing any scraping in scrapy, you will have to think about which item fields you want to use to collect the data. So figuring out what those fields are before you write your scraper is a first.
I don't necessarily think you need your own middleware unless your data is particularly needing work done in terms of requests and responses. The middlewares are mostly useful for processing reqeuest and responses rather than data manipulation/cleaning. i.e if you have duplicates or needing to change responses or add requests etc...
Scrapy is built for data extraction and has already a robust way of putting that information into a Dictionary-like API called an ItemsAdapter, which is essentially a wrapper for different ways of storing data.
There also ways to clean data in small ways and in larger ways within Scrapy. You can ItemLoaders which puts your items through a small function that can manipulate data or use a pipeline. Pipelines give you lots of flexibility in handling extracted data.
You will have to think about database design and what tables you're going to use, because ultimately that is where you will putting your data. It's quite easy to setup a Database Pipeline in Scrapy. The database pipeline is flexibly enough for you place data into any table you want using SQL queries.
Familiarising yourself with the Scrapy Architecture here might help you create a mental model of the process. You can see this here

Flask website backend structure guidance assistance?

I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.
This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.

django cms copy translated pages from dev serv to prod serv

Django cms can be as powerful as buggy sometimes.
Thus my company just hired some translators and they worked on the development serveur.
They translated a lot of pages from english to turkey, spanich, french...
That's why I am asked to find a way to copy those pages to the production server.
I am not completely at ease yet with back-end stuff and after reading this subject :
copy pages from dev to prod
I feel less comfortable haha.
Isn't a way to copy pages excluding one language (the main one actually : english) 'easily' ?
Thank you in advance for the time spent on my request.
Essentially what you need is in the database, so some kind of export from dev & import to production is what you'll need.
Usually when I'm preparing a site in dev for production release I'll do a full export, sanitize any data, remove anything not needed then import to production. It's easier doing things that way, especially dealing with multiple languages because of the way CMS divides up content for pages between multiple tables.
For example, all the page settings are kept in cms_title and links back to the cms_page table where there's a copy of each page per language. Because you've got a page per language then you'll usually find the plugin's store everything in the same table, so the text plugin has it's djangocms_text_ckeditor_text table which stores all your translated content in the one place.
You might be best exporting all the content tables you need from the CMS & plugins, like cms_cmsplugin, cms_page, cms_title, djangocms_text_ckeditor_text, etc. Then import them to a local database to test and/or modify any content before it ends up on your production server.

Capturing information for customers such a referral URL and conversion

I was hoping to create my own in-house analytics so I tell my customers how many visits their company page got on my site and which URL they came from. I am coding this in Python (Flask) and I wondered if anyone could tell me what is the standard, or sensible approach to this problem.
I think it might be to have some sort of Redis queue which is triggered when a visitor comes and then this information is added to the database later so the site doesn't seem slow.
The standard, and sensible approach is to use Google Analytics. If you must roll your own, you have one of two approaches. JavaScript that is executed on every page (like GA) and pulls this kind of info into a DB. The second approach is parsing log files on the server. Awstats is a good bet for that.

Efficient storage of and access to web pages with Python

So like many people I want a way to download, index/extract information and store web pages efficiently. My first thought is to use MySQL and simply shove the pages in which would let me use FULLTEXT searches which would let me do ad hoc queries easily (in case I want to see if something exists and extract it/etc.). But of course performance wise I have some concerns especially with large objects/pages and high volumes of data. So that leads me to look at things like CouchDB/search engines/etc. So to summarize, my basic requirements are:
It must be Python compatible (libraries/etc.)
Store meta data (URL, time retrieved, any GET/POST stuff I sent), response code, etc. of the page I requested.
Store a copy of the original web page as sent by the server (might be content, might be 404 search page, etc.).
Extract information from the web page and store it in a database.
Have the ability to do ad hoc queries on the existing corpus of original web pages (for example a new type of information I want to extract, or to see how many of the pages have the string "fizzbuzz" or whatever in them.
And of course it must be open source/Linux compatible, I have no interest in something I can't modify or fiddle with.
So I'm thinking several broad options are:
Toss everything into MySQL, use FULLTEXT, go nuts, shard the contact if needed.
Toss meta data into MySQL, store the data on the file system or something like CouchDB, write some custom search stuff.
Toss meta data into MySQL, store the data on a file system with a web server (maybe /YYYY/MM/DD/HH/MM/SS/URL/), make sure there is no default index.html/etc specified (directory index each directory in other words) and use some search engine like Lucene or Sphinx index the content and use that to search. Biggest downside I see here is the inefficiency of repeatedly crawling the site.
Other solutions?
When answering please include links to any technologies you mention and if possible what programming languages it has libraries for (i.e. if it's Scala only or whatever it's probably not that useful since this is a Python project). If this question has already been asked (I'm sure it must have been) please let me know (I searched, no luck).
Why do you think solution (3), the Sphinx-based one, requires "repeatedly crawling the site"? Sphinx can accept and index many different data sources, including MySQL and PostgreSQL "natively" (there are contributed add-ons for other DBs such as Firebird) -- you can keep your HTML docs as columns in your DB if you like (modern PostgreSQL versions should have no trouble with that, and I imagine that MySQL's wouldn't either), just use Sphinx superior indexing and full-text search (including stemming &c). Your metadata all comes from headers, after all (plus the HTTP request body if you want to track requests in which you POSTed data, but not the HTTP response body at any rate).
One important practical consideration: I would recommend standardizing on UTF-8 -- html will come to you in all sorts of weird encodings, but there's no need to get crazy supporting that at search time -- just transcode every text page to UTF-8 upon arrival (from whatever funky encoding it came in), before storing and indexing it, and live happily ever after.
Maybe you could special-case non-textual responses to keep those in files (I can imagine that devoting gigabytes in the DB to storing e.g. videos which can't be body-indexed anyway might not be a good use of resources).
And BTW, Sphinx does come with Python bindings, as you request.
You may be trying to achieve too much with the storage of the html (and supporting files). It seems you wish this repository would both
allow to display a particular page as it was in its original site
provide indexing for locating pages relevant to a particular search criteria
The html underlying a web page once looked a bit akin to a self-standing document, but the pages crawled off the net nowadays are much messier: javascript, ajax snippets, advertisement sections, image blocks etc.
This reality may cause you to rethink the one storage for all html approach. (And also the parsing / pre-processing of the material crawled, but that's another story...)
On the other hand, the distinction between metadata and the true text content associated with the page doesn't need to be so marked. By "true text content", I mean [possibly partially marked-up] text from the web pages that is otherwise free of all other "Web 2.0 noise") Many search engines, including Solr (since you mentioned Lucene) now allow mixing the two genres, in the form of semi-structured data. For operational purposes (eg to task the crawlers etc.), you may keep a relational store with management related metadata, but the idea is that for search purposes, fielded and free-text info can coexist nicely (at the cost of pre-processing much of the input data).
It sounds to me like you need a content management system. Check out Plone. If that's not what you want maybe a web framework, like Grok, BFG, Django, Turbogears, or anything on this list. If that isn't good either, then I don't know what you are asking. :-)

Categories