SPARQL query on multiple RDF files - python

I have some basics of programming, but I am completely new to RDF or Sparql, so I hope to be clear in what follows.
I am trying to download some data available at http://data.camera.it/data/en/datasets/, and all the data are organized in rdf-xml format, in an ontology.
I noticed this website has a SPARQL Query Editor online (http://dati.camera.it/sparql), and using some of their examples I was able to retrieve and convert some of the data I need using Python. I used the following code and query, using SparqlWrapper
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dati.camera.it/sparql")
sparql.setQuery(
'''
SELECT distinct ?deputatoId ?cognome ?nome ?data ?argomento titoloSeduta ?testo
WHERE {
?dibattito a ocd:dibattito; ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_17>.
?dibattito ocd:rif_discussione ?discussione.
?discussione ocd:rif_seduta ?seduta.
?seduta dc:date ?data; dc:title ?titoloSeduta.
?seduta ocd:rif_assemblea ?assemblea.
?discussione rdfs:label ?argomento.
?discussione ocd:rif_intervento ?intervento.
?intervento ocd:rif_deputato ?deputatoId; dc:relation ?testo.
?deputatoId foaf:firstName ?nome; foaf:surname ?cognome .
}
ORDER BY ?data ?cognome ?nome
LIMIT 100
'''
)
sparql.setReturnFormat(JSON)
results_raw = sparql.query().convert()
However, I have a problem because the website allows only to download 10,000 values. As far as I understood, this limit cannot be modified.
Therefore I decided to download the datasets on my computer. I tried to work on all these rdf files, but I don't know how to do it, since, as far as I know, the SparqlWrapper does not work with local files.
So my questions are:
How do I create a dataset containing all the RDF files so that I can work on them as if it were a single object?
How do I query on such an object to retrieve the information I need? Is that possible?
Is this way of reasoning the right approach?
Any suggestion on how to tackle the problem is appreciated.
Thank you!

Download all the RDF/XML files from their download area, and load them into a local instance of Virtuoso (which happens to be the engine they are using for their public SPARQL endpoint). You will have the advantage of running a much more recent version (v7.2.5.1 or later), whether Open Source or Enterprise Edition than the one they've got (Open Source v7.1.0, from March, 2014!).
Use your new local SPARQL endpoint, found at http://localhost:8890/sparql by default. You can configure it to have no limits on result set sizes, or query runtimes, or otherwise.
Seems likely.
(P.S. You might encourage the folks at dati.camera.it (assistenza-dati#camera.it) to upgrade their Virtuoso instance. There are substantial performance and feature enhancements awaiting!)

Related

Get the data from a REST API and store it in HDFS/HBase

I'm new to Big data. I learned that HDFS is for storing more of structured data and HBase is for storing unstructured data. I'm having a REST API where I need to get the data and load it into the data warehouse (HDFS/HBase). The data is in JSON format. So which one would be better to load the data into? HDFS or HBase? Also can you please direct me to some tutorial to do this. I came across this about Tutorial with Streaming Data. But I'm not sure if this will fit my use case.
It would be of great help if you can guide me to a particular resource/ technology to solve this issue.
There is several questions you have to think about
Do you want to work with batch files or streaming ? It depends on the rate at which your REST API will be requested
For the Storage there is not just HDFS and Hbase, you have a lot of other solutions as Casandra, MongoDB, Neo4j. All depends on the way you want to use it (Random Acces VS Full Scan, Update with versioning VS writing new lines, Concurrency access). For example Hbase is good for random access, Neo4j for graph storage,... If you are receiving JSON files, MongoDB can be a god choice as it stores object as document.
What is the size of your data ?
Here is good article on questions to think about when you start a big data project documentation

Flask website backend structure guidance assistance?

I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.
This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.

Insert a row in to google BigQuery table from the values of a python list

I am a newbie who is exploring Google BigQuery. I would like to insert a row into the BigQuery table from a python list which contains the row values.
To be more specific my list looks like this: [100.453, 108.75, 50.7773]
I found a couple of hints from BigQuery-Python library insert
and also looked in to pandas bigquery writer but not sure whether they are perfect for my usecase.
What would be the better solution?
Thanks in advance.
Lot's of resources but I usually find code examples to be the most informative for beginning.
Here's an excellent collection of bigquery python code samples: https://github.com/googleapis/python-bigquery/tree/master/samples.
One straight forward way to insert rows:
from google.cloud import bigquery
bq_client = bigquery.Client()
table = bq_client.get_table("{}.{}.{}".format(PROJECT, DATASET, TABLE))
rows_to_insert = [{u"COL1": 100.453, u"COL2": 108.75, u"COL3": 50.7773}, {u"COL1": 200.348, u"COL2": 208.29, u"COL3": 60.7773}]
errors = bq_client.insert_rows_json(table, rows_to_insert)
if errors == []:
print("success")
Lastly to verify if it's inserted successfully use:
bq query --nouse_legacy_sql 'SELECT * FROM `PROJECT.DATASET.TABLE`'
Hope that helps everyone!
To work with Google Cloud Platform services using Python, I would recommend using python google-cloud and for BigQuery specifically the submodule google-cloud-bigquery(this was also recommended by #polleyg. This is an open-source Python idiomatic client maintained by the Google. This will allow you to easily use all the google cloud services in a simple and consistent way.
More specifically, the example under Insert rows into a table’s data in the documentation shows how to insert Python tuples/lists into a BigQuery table.
However depending on your needs, you might need other options, my ordering of options:
If the you use code that has a native interface with Google Services (e.g. BigQuery) and this suits your needs, use this. In your case test if Pandas-BigQuery works for you.
If your current code/modules don't have a native interface, try the Google maintained idiomatic client google-cloud.
If that doesn't suit your needs, use an external idiomatic client like tylertreat/BigQuery-Python. The problem is that you will have different inconsistent clients for the different services. The benefit can be that it adds some functionalities not provided in the google-cloud module.
Finally, if you work with very new alpha/beta features, use the APIs directly with the Google API module, this is will always give you access to the latest APIs, but is a bit harder to work with. So only use this if the previous options don't give you what you need.
The Google BigQuery docs show you how:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#bigquery_stream_data_python
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/stream_data.py

Loading Dbpedia data and ontologies into virtuoso locally

I am trying to set up dbpedia sparql endpoint locally with the help of virtuoso triple store.
I followed two links.
Loading data with the help of folders
Loading data with the help of symbolic links
from these links. I followed the configurations according to the second link and I tried to load the data only from "en" folder and dbpedia-owl.owl file into "en" folder itself. I tried to load this en folder with the following command onto isql
ld_dir_all('/media/D8849AB0849A911C/datasets/en','*','http://dbpedia.org');
I did the further processing for committing this data. Then checked the data onto the local endpoint "localhost:8890/sparql". But the prefix "dbpedia-owl" seems to be missing. I also checked into the list of "namespace prefixes". but "dbpedia-owl" seems to be missing. What did i do wrong while loading the data? Also I tried to add dbpedia-owl.gz file too. But the "dbpedia-owl" still doesnt seems to work on the endpoint.
when i tried to query this
select ?type {
?type a owl:Class .
} LIMIT 5
I got the result as
type
http://www.w3.org/2002/07/owl#Thing
http://www.w3.org/2002/07/owl#Nothing
http://dbpedia.org/ontology/Abbey
http://dbpedia.org/ontology/Abbey
http://dbpedia.org/ontology/AcademicJournal
So this result shows the data from ontology file. But the "dbpedia-owl" is not getting linked to this ontology file. Help is appreciated.
This is a very late answer, but I stumbled across this question...
As far as I know, you loaded the ontology into virtuoso (so the classes and properties definitions are available in the DB), but this is different from defining a prefix and associating it with an URL.
If you want to do the later programmatically, just use :
DB.DBA.XML_SET_NS_DECL ('dbpedia-owl', 'http://dbpedia.org/ontology/', 2);
This just tell to virtuoso that, locally, the dbpedia-owl prefix will be used to denote the dbpedia ontology URL. There is now such thing as an universal prefix, hence you may also want to use any other prefix like dbpo or whatever you see fit on your local virtuoso server.

error in retrieving tables in unicode data using Azure/Python

I'm using Azure and the python SDK.
I'm using Azure's table service API for DB interaction.
I've created a table which contains data in unicode (hebrew for example). Creating tables and setting the data in unicode seems to work fine. I'm able to view the data in the database using Azure Storage Explorer and the data is correct.
The problem is when retrieving the data.. Whenever I retrieve specific row, data retrieval works fine for unicoded data:
table_service.get_entity("some_table", "partition_key", "row_key")
However, when trying to get a number of records using a filter, an encode exception is thrown for any row that has non-ascii chars in it:
tasks = table_service.query_entities('some_table', "PartitionKey eq 'partition_key'")
Is this a bug on the azure python SDK? Is there a way to set the encoding beforehand so that it won't crash? (azure doesn't give access to sys.setdefaultencoding and using DEFAULT_CHARSET on settings.py doesn't work as well)
I'm using https://www.windowsazure.com/en-us/develop/python/how-to-guides/table-service/ as reference to the table service API
Any idea would be greatly appreciated.
This looks like a bug in the Python library to me. I whipped up a quick fix and submitted a pull request on GitHub: https://github.com/WindowsAzure/azure-sdk-for-python/pull/59.
As a workaround for now, feel free to clone my repo (remembering to checkout the dev branch) and install it via pip install <path-to-repo>/src.
Caveat: I haven't tested my fix very thoroughly, so you may want to wait for the Microsoft folks to take a look at it.

Categories