Upload contents of spark dataframe as CSV to REST API using Python? - python

I'm trying to piece together the code required to run a query on a Hive/HDFS database (i.e. the same query I could run in Hive or Impala, using Zeppelin or Hue), then upload the contents of that to a REST API URL. I'm a very experienced developer but new to Python, dataframes, Spark, HDFS etc.
I've got my SQL query that returns the correct data (e.g. using Impala or Hive).
I've got Python code that will connect to a REST API endpoint for upload:
import requests
x = requests.post(url, data = my_data)
I know that Python pandas library can save out CSV https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv
I'm not sure how to get Python to run the query though, and what else I might be missing here...
Execution environment is python or pyspark running in Apache Zeppelin, table is in Hadoop/HDFS
Apologies if I'm misusing terms here, just trying to get my head around this :)
Thanks

Related

Google BigQuery + Python

I need to do an exploratory analysis using python over two tables that are in a Google BigQuery database.
The only thing I was provided is a JSON file containing some credentials.
How can I access the data using this JSON file?
This is the first time I try to do something like this, so I have no idea on how to do it.
I tried reading different tutorials and documentations, but nothing worked.

Automatically Importing Data From a Website in Google Cloud

I am trying to find a way to automatically update a big query table using this link: https://www6.sos.state.oh.us/ords/f?p=VOTERFTP:DOWNLOAD::FILE:NO:2:P2_PRODUCT_NUMBER:1
This link is updated with new data every week and I want to be able to replace the Big Query table with this new data. I have researched that you can export spreadsheets to Big Query, but that is not a streamlined approach.
How would I go about submitting a script that imports the data and having that data be fed to Big Query?
I assume you already have a working script that parses the content of the URL and places the contents in BigQuery. Based on that I would recommend the following workflow:
Upload the script as a Google Cloud Function. If your script isn't written in a compatible language (i.e. Python, Node, Go), you can use Google Cloud Run instead. Set the Cloud Function to be triggered by a Pub/Sub message. In this scenario, the content of your Pub/Sub message doesn't matter.
Set up a Google Cloud Scheduler job to (a) run at 12am every Saturday (or whatever time you wish) and (b) send a dummy message to the Pub/Sub topic that your Cloud Function is subscribed to.
You can try using a HTTP request to the page using a programming language like Python with the Request library, save the data into a Pandas Dataframe or a CSV file, and then using the BigQuery libraries you can push that data into a BigQuery table.

SPARQL query on multiple RDF files

I have some basics of programming, but I am completely new to RDF or Sparql, so I hope to be clear in what follows.
I am trying to download some data available at http://data.camera.it/data/en/datasets/, and all the data are organized in rdf-xml format, in an ontology.
I noticed this website has a SPARQL Query Editor online (http://dati.camera.it/sparql), and using some of their examples I was able to retrieve and convert some of the data I need using Python. I used the following code and query, using SparqlWrapper
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dati.camera.it/sparql")
sparql.setQuery(
'''
SELECT distinct ?deputatoId ?cognome ?nome ?data ?argomento titoloSeduta ?testo
WHERE {
?dibattito a ocd:dibattito; ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_17>.
?dibattito ocd:rif_discussione ?discussione.
?discussione ocd:rif_seduta ?seduta.
?seduta dc:date ?data; dc:title ?titoloSeduta.
?seduta ocd:rif_assemblea ?assemblea.
?discussione rdfs:label ?argomento.
?discussione ocd:rif_intervento ?intervento.
?intervento ocd:rif_deputato ?deputatoId; dc:relation ?testo.
?deputatoId foaf:firstName ?nome; foaf:surname ?cognome .
}
ORDER BY ?data ?cognome ?nome
LIMIT 100
'''
)
sparql.setReturnFormat(JSON)
results_raw = sparql.query().convert()
However, I have a problem because the website allows only to download 10,000 values. As far as I understood, this limit cannot be modified.
Therefore I decided to download the datasets on my computer. I tried to work on all these rdf files, but I don't know how to do it, since, as far as I know, the SparqlWrapper does not work with local files.
So my questions are:
How do I create a dataset containing all the RDF files so that I can work on them as if it were a single object?
How do I query on such an object to retrieve the information I need? Is that possible?
Is this way of reasoning the right approach?
Any suggestion on how to tackle the problem is appreciated.
Thank you!
Download all the RDF/XML files from their download area, and load them into a local instance of Virtuoso (which happens to be the engine they are using for their public SPARQL endpoint). You will have the advantage of running a much more recent version (v7.2.5.1 or later), whether Open Source or Enterprise Edition than the one they've got (Open Source v7.1.0, from March, 2014!).
Use your new local SPARQL endpoint, found at http://localhost:8890/sparql by default. You can configure it to have no limits on result set sizes, or query runtimes, or otherwise.
Seems likely.
(P.S. You might encourage the folks at dati.camera.it (assistenza-dati#camera.it) to upgrade their Virtuoso instance. There are substantial performance and feature enhancements awaiting!)

How possible data pre processing possible with python for Klipfolio

I am not sure exactly right place to ask but I need any single infrmation about it.
I am going to create a dashboard with Klipfolio and I want to make data pre processing with Python and integrate in klipfolio but unfortunately Klipfoli does not have any specific place to do it.
Is anyone used Klipfolio, data pre processing with Python for Klipfolio.
While Klipfolio does not have any Python integrations, Klipfolio does connect to various types of SQL databases. One work around is to dump your processed data from Python into a SQL database and then connect that SQL database with Klipfolio to make data sources to build the visualization.
You can either directly connect to the database, or if you are running python on a server, you can user "Rest/URL" method in Klipfolio to directly connect to your python code and integrate the output into your dashboard.

error in retrieving tables in unicode data using Azure/Python

I'm using Azure and the python SDK.
I'm using Azure's table service API for DB interaction.
I've created a table which contains data in unicode (hebrew for example). Creating tables and setting the data in unicode seems to work fine. I'm able to view the data in the database using Azure Storage Explorer and the data is correct.
The problem is when retrieving the data.. Whenever I retrieve specific row, data retrieval works fine for unicoded data:
table_service.get_entity("some_table", "partition_key", "row_key")
However, when trying to get a number of records using a filter, an encode exception is thrown for any row that has non-ascii chars in it:
tasks = table_service.query_entities('some_table', "PartitionKey eq 'partition_key'")
Is this a bug on the azure python SDK? Is there a way to set the encoding beforehand so that it won't crash? (azure doesn't give access to sys.setdefaultencoding and using DEFAULT_CHARSET on settings.py doesn't work as well)
I'm using https://www.windowsazure.com/en-us/develop/python/how-to-guides/table-service/ as reference to the table service API
Any idea would be greatly appreciated.
This looks like a bug in the Python library to me. I whipped up a quick fix and submitted a pull request on GitHub: https://github.com/WindowsAzure/azure-sdk-for-python/pull/59.
As a workaround for now, feel free to clone my repo (remembering to checkout the dev branch) and install it via pip install <path-to-repo>/src.
Caveat: I haven't tested my fix very thoroughly, so you may want to wait for the Microsoft folks to take a look at it.

Categories