I'm a Ruby dev doing a lot of data work that's decided to switch to Python. I'm enjoying making the transition so far and have been blown away by Pandas, Jupyter Notebooks etc.
My current task is to write a lightweight RESTful API that under the hood is running queries against Google BigQuery.
I have a really simple test running in Flask, this works fine, but I did have trouble rendering the BigQuery response as JSON. To get around this, I used Pandas and then converted the dataframe to JSON. While it works, this feels like an unnecessary step and I'm not even sure this is a legitimate use case for Pandas. I have also read that creating a dataframe can be slow as data volume increases.
Below is my little mock up test in Flask. It would be really helpful to hear from experienced Python Devs how you'd approach this and if there are any other libraries I should be looking at here.
from flask import Flask
from google.cloud import bigquery
import pandas
app = Flask(__name__)
#app.route("/bq_test")
def bq_test():
client = bigquery.Client.from_service_account_json('/my_creds.json')
sql = """select * from `my_dataset.my_table` limit 1000"""
query_job = client.query(sql).to_dataframe()
return query_job.to_json(orient = "records")
if __name__ == "__main__":
app.run()
From the BigQuery documentation-
BigQuery supports functions that help you retrieve data stored in
JSON-formatted strings and functions that help you transform data into
JSON-formatted strings:
JSON_EXTRACT or JSON_EXTRACT_SCALAR
JSON_EXTRACT(json_string_expr, json_path_string_literal), which returns JSON values as STRINGs.
JSON_EXTRACT_SCALAR(json_string_expr, json_path_string_literal), which returns scalar JSON values as STRINGs.
Description
The json_string_expr parameter must be a JSON-formatted string. ...
The json_path_string_literal parameter identifies the value or values you want to obtain from the JSON-formatted string. You construct this parameter using the JSONPath format.
https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions
Related
I'm trying to piece together the code required to run a query on a Hive/HDFS database (i.e. the same query I could run in Hive or Impala, using Zeppelin or Hue), then upload the contents of that to a REST API URL. I'm a very experienced developer but new to Python, dataframes, Spark, HDFS etc.
I've got my SQL query that returns the correct data (e.g. using Impala or Hive).
I've got Python code that will connect to a REST API endpoint for upload:
import requests
x = requests.post(url, data = my_data)
I know that Python pandas library can save out CSV https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv
I'm not sure how to get Python to run the query though, and what else I might be missing here...
Execution environment is python or pyspark running in Apache Zeppelin, table is in Hadoop/HDFS
Apologies if I'm misusing terms here, just trying to get my head around this :)
Thanks
I am trying to query Google BigQuery using the Pandas/Python client interface. I am following the tutorial here: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas. I was able to get it to work but I want to query the data as the JSON format that can be downloaded directly from the WebUI (see screenshot). Is there a way to download data as the JSON structure pictured instead of converting it to the data frame object?
I imagine the command would be somewhere around this part of the code from the tutorial:
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
Just add .to_json(orient='records') call after converting to dataframe:
json_data = bqclient.query(query_string).result().to_dataframe(bqstorage_client=bqstorageclient).to_json(orient='records')
pandas docs
Disclosure: I am not developer or something and just had to do it because, well, I had to do it. Of course, I was super proud when coded hangman in python but that was pretty it.
So I had to put data from one service to MySQL table, and connect to it through their aggregation API. To my surprise everything is working as expected BUT there are two problems:
Script is suuuuuper slow. It takes around 500-700 seconds to execute it.
It works when I run it manually, but it timeouts on scheduler.
So my question to you, fellow community, could you hint me what should I read or, maybe, change to make it at least a little bit faster.
As a business background, I have to run separate queries on 10 different languages, but in the code below, I provide only one language and put description around it.
The timeout on scheduled execution is somewhere between 5th and 6th language.
# used modules
import requests
import json
import pandas as pd
import MySQLdb
url = 'here comes URI to service API aggregation call'
headers = {'Integration-Key':'Key','Content-Type' : 'application/json'}
# the next one is different request for each of 10 languages, so 10 variables.
data_language = '''{Here comes a long long JSON request so API can aggregate it all }'''
# requesting data from API
# Again, 10 times for the next block
response = requests.post(url, headers=headers, data=data_en)
json_data = json.loads(response.text)
df_en = pd.DataFrame(json_data['results'])
# So on schedule, it time outs after 5th or 6th language
# creating merged table
df = pd.concat([df_en,df_sv,and_so_on],ignore_index=True)
db=MySQLdb.connect(host="host", user="user",passwd="pws",db="db")
df.to_sql(con=db, name='nps', if_exists='replace', flavor='mysql')
I have never found to_sql to work for datasets that are at all large. I recommend turning your dataframe into a CSV and then using psycopg2 to do a bulk COPY up to your table.
I am a newbie who is exploring Google BigQuery. I would like to insert a row into the BigQuery table from a python list which contains the row values.
To be more specific my list looks like this: [100.453, 108.75, 50.7773]
I found a couple of hints from BigQuery-Python library insert
and also looked in to pandas bigquery writer but not sure whether they are perfect for my usecase.
What would be the better solution?
Thanks in advance.
Lot's of resources but I usually find code examples to be the most informative for beginning.
Here's an excellent collection of bigquery python code samples: https://github.com/googleapis/python-bigquery/tree/master/samples.
One straight forward way to insert rows:
from google.cloud import bigquery
bq_client = bigquery.Client()
table = bq_client.get_table("{}.{}.{}".format(PROJECT, DATASET, TABLE))
rows_to_insert = [{u"COL1": 100.453, u"COL2": 108.75, u"COL3": 50.7773}, {u"COL1": 200.348, u"COL2": 208.29, u"COL3": 60.7773}]
errors = bq_client.insert_rows_json(table, rows_to_insert)
if errors == []:
print("success")
Lastly to verify if it's inserted successfully use:
bq query --nouse_legacy_sql 'SELECT * FROM `PROJECT.DATASET.TABLE`'
Hope that helps everyone!
To work with Google Cloud Platform services using Python, I would recommend using python google-cloud and for BigQuery specifically the submodule google-cloud-bigquery(this was also recommended by #polleyg. This is an open-source Python idiomatic client maintained by the Google. This will allow you to easily use all the google cloud services in a simple and consistent way.
More specifically, the example under Insert rows into a table’s data in the documentation shows how to insert Python tuples/lists into a BigQuery table.
However depending on your needs, you might need other options, my ordering of options:
If the you use code that has a native interface with Google Services (e.g. BigQuery) and this suits your needs, use this. In your case test if Pandas-BigQuery works for you.
If your current code/modules don't have a native interface, try the Google maintained idiomatic client google-cloud.
If that doesn't suit your needs, use an external idiomatic client like tylertreat/BigQuery-Python. The problem is that you will have different inconsistent clients for the different services. The benefit can be that it adds some functionalities not provided in the google-cloud module.
Finally, if you work with very new alpha/beta features, use the APIs directly with the Google API module, this is will always give you access to the latest APIs, but is a bit harder to work with. So only use this if the previous options don't give you what you need.
The Google BigQuery docs show you how:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#bigquery_stream_data_python
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/stream_data.py
I know in javascript there is the stringify command, but is there something like this in python for pyramid applications? Right now I have a view callable that takes an uploaded stl file and parses it into a format as such. data= [[[x1,x2,x3],...],[[v1,v2,v3],...]] How can I convert this into a JSON string so that it can be stored in an SQLite database? Can I insert the javascript stringify command into my views.py file? Is there an easier way to do this?
You can use the json module to do this:
import json
data_str = json.dumps(data)
There are other array representations that can be stored in a database as well (see pickle).
However, if you're actually constructing a database, you should know that's it's considered a violation of basic database principles (first normal form) to store multiple data in a single value in a relational database. What you should do is decompose the array into rows (and possibly separate tables) and store a single value in each "cell". That will allow you to query and analyze the data using SQL.
If you're not trying to build an actual database (if the array is completely opaque to your application and you'll never want to search, sort, aggregate, or report by the values inside the array) you don't need to worry so much about normal form but you may also find that you don't need the overhead of an SQL database.
you also can use cjson, it is faster than json library.
import cjson
json_str = cjson.encode(your_string)