I'm trying to upload a local CSV to google big query using python
def uploadCsvToGbq(self,table_name):
load_config = {
'destinationTable': {
'projectId': self.project_id,
'datasetId': self.dataset_id,
'tableId': table_name
}
}
load_config['schema'] = {
'fields': [
{'name':'full_name', 'type':'STRING'},
{'name':'age', 'type':'INTEGER'},
]
}
load_config['sourceFormat'] = 'CSV'
upload = MediaFileUpload('sample.csv',
mimetype='application/octet-stream',
# This enables resumable uploads.
resumable=True)
start = time.time()
job_id = 'job_%d' % start
# Create the job.
result = bigquery.jobs.insert(
projectId=self.project_id,
body={
'jobReference': {
'jobId': job_id
},
'configuration': {
'load': load_config
}
},
media_body=upload).execute()
return result
when I run this it throws error like
"NameError: global name 'MediaFileUpload' is not defined"
whether any module is needed please help.
One of easiest method to upload to csv file in GBQ is through pandas.Just import csv file to pandas (pd.read_csv()). Then from pandas to GBQ (df.to_gbq(full_table_id, project_id=project_id)).
import pandas as pd
import csv
df=pd.read_csv('/..localpath/filename.csv')
df.to_gbq(full_table_id, project_id=project_id)
Or you can use client api
from google.cloud import bigquery
import pandas as pd
df=pd.read_csv('/..localpath/filename.csv')
client = bigquery.Client()
dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table('new_table')
client.load_table_from_dataframe(df, table_ref).result()
pip install --upgrade google-api-python-client
Then on top of your python file write:
from googleapiclient.http import MediaFileUpload
But care you miss some parenthesis. Better write:
result = bigquery.jobs().insert(projectId=PROJECT_ID, body={'jobReference': {'jobId': job_id},'configuration': {'load': load_config}}, media_body=upload).execute(num_retries=5)
And by the way, you are going to upload all your CSV rows, including the top one that defines columns.
The class MediaFileUpload is in http.py. See https://google-api-python-client.googlecode.com/hg/docs/epy/apiclient.http.MediaFileUpload-class.html
Related
I'm using Wikidata query service to obtain values and this is the code:
pip install sparqlwrapper
import sys
from SPARQLWrapper import SPARQLWrapper, JSON
endpoint_url = "https://query.wikidata.org/sparql"
query = """#List of organizations
SELECT ?org ?orgLabel
WHERE
{
?org wdt:P31 wd:Q4830453. #instance of organizations
?org wdt:P17 wd:Q96. #Mexico country
SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}"""
def get_results(endpoint_url, query):
user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
# TODO adjust user agent; see https://w.wiki/CX6
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
return sparql.query().convert()
results = get_results(endpoint_url, query)
for result in results["results"]["bindings"]:
print(result)
This code give me the data that I need but I'm having problems trying to get this information with this line:
results.to_csv('results.csv', index=False)
with this error:
'dict' object has no attribute 'to_csv'
I import pandas and numpy to do it, but I'm still with problems so I would like to know how to put this results in a format to create my csv file with the data obtained.
Here you have some screenshots.
results is a dictionary, that is a python data structure which you can't invoke a method to_csv on.
For safely storing a csv from a python dictionary you can use external libraries (see also the documentation on python.org).
The specific solution depends on which (meta)data you exactly want to export. In the following I assume that you want to store the value for org and orgLabel.
import csv
bindings = results['results']['bindings']
sparqlVars = ['org', 'orgLabel']
metaAttribute = 'value'
with open('results.csv', 'w', newline='') as csvfile :
writer = csv.DictWriter(csvfile, fieldnames=sparqlVars)
writer.writeheader()
for b in bindings :
writer.writerow({var:b[var][metaAttribute] for var in sparqlVars})
And the output is:
org,orgLabel
http://www.wikidata.org/entity/Q47099,"Grupo Televisa, owner of TelevisaUnivision"
http://www.wikidata.org/entity/Q429380,Aeropuertos y Servicios Auxiliares
http://www.wikidata.org/entity/Q482267,América Móvil
...
As a committer of
https://github.com/WolfgangFahl/pyLoDStorage
i am going to point out that the SPARQL class of pyLodStorage is explicitly there to make conversion to other formats simple.
pip install pyLodStorage
sparqlquery --query 'SELECT ?org ?orgLabel
WHERE
{
?org wdt:P31 wd:Q4830453. #instance of organizations
?org wdt:P17 wd:Q96. #Mexico country
SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}' --format csv
result:
"org","orgLabel"
"http://www.wikidata.org/entity/Q47099","Grupo Televisa, owner of TelevisaUnivision"
"http://www.wikidata.org/entity/Q482267","América Móvil"
"http://www.wikidata.org/entity/Q515411","Q515411"
"http://www.wikidata.org/entity/Q521673","Grupo Modelo"
for course you can get the same result directly via the python APIs:
from lodstorage.sparql import SPARQL
from lodstorage.csv import CSV
sparqlQuery="""SELECT ?org ?orgLabel
WHERE
{
?org wdt:P31 wd:Q4830453. #instance of organizations
?org wdt:P17 wd:Q96. #Mexico country
SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}"""
sparql=SPARQL("https://query.wikidata.org/sparql")
qlod=sparql.queryAsListOfDicts(sparqlQuery)
csv=CSV.toCSV(qlod)
print(csv)
I am writing a script to pull back metrics for ELBv2 (Network LB) using Boto3 but it just keeps returning empty datapoints. I have read through the AWS and Boto docs, and scoured here for answers but nothing seems to be correct. I am aware CW likes everything the be exact and so I have played with different dimensions, different time windows, datapoint periods, different metrics, with and without specifying units etc to no avail.
My script is here:-
#!/usr/bin/env python
import boto3
from pprint import pprint
from datetime import datetime
from datetime import timedelta
def initialize_client():
client = boto3.client(
'cloudwatch',
region_name='eu-west-1'
)
return client
def request_metric(client):
response = client.get_metric_statistics(
Namespace='AWS/NetworkELB',
Period=300,
StartTime=datetime.utcnow() - timedelta(days=5),
EndTime=datetime.utcnow() - timedelta(days=1),
MetricName='NewFlowCount',
Statistics=['Sum'],
Dimensions=[
{
'Name': 'LoadBalancer',
'Value': 'net/nlb-name/1111111111'
},
{
'Name': 'AvailabilityZone',
'Value': 'eu-west-1a'
}
],
)
return response
def main():
client = initialize_client()
response = request_metric(client)
pprint(response['Datapoints'])
return 0
main()
I have a problem with reading laz files that are stored at IBM cloud object storage. I have built pywren-ibm library with all requirements that pdal one of them with docker and I then deployed it to IBM cloud function as an action, where the error that appear is "Unable to open stream for 'Colorea.laz" with error 'No such file or directory.' How can I read the files with pdal in IBM cloud function?
Here is some of the code:
import pywren_ibm_cloud as pywren
import pdal
import json
def manip_data(bucket, key, data_stream):
data = data_stream.read()
cr_json ={
"pipeline": [
{
"type": "readers.las",
"filename": f"{key}"
},
{
"type":"filters.range",
"limits":"Classification[9:9]"
}
]
}
pipeline = pdal.Pipeline(json.dumps(cr_json, indent=4))
pipeline.validate()
pipeline.loglevel = 8
n_points = pipeline.execute()
bucketname = 'The bucket name'
pw = pywren.ibm_cf_executor(runtime='ammarokran/pywren-pdal:1.0')
pw.map(manip_data, bucketname, chunk_size=None)
print(pw.get_result())
The code is running from local pc with jupyter notebook.
You'll need to specify some credentials and the correct endpoint for the bucket holding the files you're trying to access. Not totally sure how that works with a custom runtime, but typically you can just pass a config object in the executor.
import pywren_ibm_cloud as pywren
config = {'pywren' : {'storage_bucket' : 'BUCKET_NAME'},
'ibm_cf': {'endpoint': 'HOST',
'namespace': 'NAMESPACE',
'api_key': 'API_KEY'},
'ibm_cos': {'endpoint': 'REGION_ENDPOINT',
'api_key': 'API_KEY'}}
pw = pywren.ibm_cf_executor(config=config)
I have just under 100M records of data that I wish to transform by denormalising a field and then input into a date partitioned GBQ table. The dates go back to 2001.
I had hoped that I could transform it with Python and then use GBQ directly from the script to accomplish this, but after reading up on this and particularly this document it doesn't seem straight-forward to create date-partitioned tables. I'm looking for a steer in the right direction.
Is there any working example of a python script that can do this? Or is it not possible to do via Python? Or is there another method someone can point me in the direction of?
Update
I'm not sure if I've missed something, but the tables created appear to be partitioned as per the insert date of when I'm creating the table and I want to partition by a date set within the existing dataset. I can't see anyway of changing this.
Here's what I've experimenting with:
import uuid
import os
import csv
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
from google.cloud.bigquery import Client
from google.cloud.bigquery import Table
import logging #logging.warning(data_store+file)
import json
import pprint
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path to service account credentials'
client = bigquery.Client()
dataset = client.dataset('test_dataset')
dataset.create()
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.partitioning_type = "DAY"
table.create()
rows = [
('bob', 30),
('bill', 31)
]
table.insert_data(rows)
Is it possible to modify this to take control of the partitions as I create tables and insert data?
Update 2
It turns out I wasn't looking for table partitioning, for my use case it's enough to simply append a date serial to the end of my table name and then query with something along the lines of:
SELECT * FROM `dataset.test_dataset.table_name_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170702'
I don't know whether this is technically still partitioning or not, but as far as I can see it has the same benefits.
Updated to latest version (google-cloud-biquery==1.4.0)
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = bigquery.Table(table_ref, schema=SCHEMA)
if partition not in ('DAY', ):
raise NotImplementedError(f"BigQuery partition type unknown: {partition}")
table.time_partitioning = bigquery.table.TimePartitioning(type_=partition)
table = client.create_table(table) # API request
You can easily create date partitioned tables using the API and Python SDK. Simply set the timePartitioning field to DAY in your script:
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/a14905b6931ba3be94adac4d12d59232077b33d2/bigquery/google/cloud/bigquery/table.py#L219
Or roll your own table insert request with the following body:
{
"tableReference": {
"projectId": "myProject",
"tableId": "table1",
"datasetId": "mydataset"
},
"timePartitioning": {
"type": "DAY"
}
}
Everything is just backed by the REST api here.
Be aware that different versions of google-api-core handle time-partitioned tables differently. For example, using google-cloud-core==0.29.1, you must use the bigquery.Table object to create time-partitioned tables:
from google.cloud import bigquery
MY_SA_PATH = "/path/to/my/service-account-file.json"
MY_DATASET_NAME = "example"
MY_TABLE_NAME = "my_table"
client = bigquery.Client.from_service_account_json(MY_SA_PATH)
dataset_ref = client.dataset(MY_DATASET_NAME)
table_ref = dataset_ref.table(MY_TABLE_NAME)
actual_table = bigquery.Table(table_ref)
actual_table.partitioning_type = "DAY"
client.create_table(actual_table)
I only discovered this by looking at the 0.20.1 Table source code. I didn't see this in any docs or examples. If you're having problems creating time-partitioned tables, I suggest that you identify the version of each Google library that you're using (for example, using pip freeze), and check your work against the library's source code.
I'm using bigquery client by tyler treat (https://github.com/tylertreat/BigQuery-Python) through python, the code compiles without error and returns true for table exists, but data insertion fails. Please let me know if something is wrong in the following code.
from oauth2client.client import flow_from_clientsecrets
from bigquery import get_client
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import sys, os
json_key = 'key.json'
client = get_client(json_key_file=json_key, readonly=True)
exists = client.check_table('Ucare', 'try')
schema = [
{'name' : 'time', 'type': 'STRING', 'mode': 'nullable'}]
created = client.create_table('Ucare', 'try', schema)
print created
print exists
rows = [('time':'ipvbs6k16sp6bkut')]
#rows = { 'rows':[{'json':{'event_type':'_session.stop'},'insertId' : 0}]}
inserted = client.push_rows('Ucare', 'try', rows,'24556135')