Create CSV from result of a for Google Colab - python

I'm using Wikidata query service to obtain values and this is the code:
pip install sparqlwrapper
import sys
from SPARQLWrapper import SPARQLWrapper, JSON
endpoint_url = "https://query.wikidata.org/sparql"
query = """#List of organizations
SELECT ?org ?orgLabel
WHERE
{
?org wdt:P31 wd:Q4830453. #instance of organizations
?org wdt:P17 wd:Q96. #Mexico country
SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}"""
def get_results(endpoint_url, query):
user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
# TODO adjust user agent; see https://w.wiki/CX6
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
return sparql.query().convert()
results = get_results(endpoint_url, query)
for result in results["results"]["bindings"]:
print(result)
This code give me the data that I need but I'm having problems trying to get this information with this line:
results.to_csv('results.csv', index=False)
with this error:
'dict' object has no attribute 'to_csv'
I import pandas and numpy to do it, but I'm still with problems so I would like to know how to put this results in a format to create my csv file with the data obtained.
Here you have some screenshots.

results is a dictionary, that is a python data structure which you can't invoke a method to_csv on.
For safely storing a csv from a python dictionary you can use external libraries (see also the documentation on python.org).
The specific solution depends on which (meta)data you exactly want to export. In the following I assume that you want to store the value for org and orgLabel.
import csv
bindings = results['results']['bindings']
sparqlVars = ['org', 'orgLabel']
metaAttribute = 'value'
with open('results.csv', 'w', newline='') as csvfile :
writer = csv.DictWriter(csvfile, fieldnames=sparqlVars)
writer.writeheader()
for b in bindings :
writer.writerow({var:b[var][metaAttribute] for var in sparqlVars})
And the output is:
org,orgLabel
http://www.wikidata.org/entity/Q47099,"Grupo Televisa, owner of TelevisaUnivision"
http://www.wikidata.org/entity/Q429380,Aeropuertos y Servicios Auxiliares
http://www.wikidata.org/entity/Q482267,América Móvil
...

As a committer of
https://github.com/WolfgangFahl/pyLoDStorage
i am going to point out that the SPARQL class of pyLodStorage is explicitly there to make conversion to other formats simple.
pip install pyLodStorage
sparqlquery --query 'SELECT ?org ?orgLabel
WHERE
{
?org wdt:P31 wd:Q4830453. #instance of organizations
?org wdt:P17 wd:Q96. #Mexico country
SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}' --format csv
result:
"org","orgLabel"
"http://www.wikidata.org/entity/Q47099","Grupo Televisa, owner of TelevisaUnivision"
"http://www.wikidata.org/entity/Q482267","América Móvil"
"http://www.wikidata.org/entity/Q515411","Q515411"
"http://www.wikidata.org/entity/Q521673","Grupo Modelo"
for course you can get the same result directly via the python APIs:
from lodstorage.sparql import SPARQL
from lodstorage.csv import CSV
sparqlQuery="""SELECT ?org ?orgLabel
WHERE
{
?org wdt:P31 wd:Q4830453. #instance of organizations
?org wdt:P17 wd:Q96. #Mexico country
SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}"""
sparql=SPARQL("https://query.wikidata.org/sparql")
qlod=sparql.queryAsListOfDicts(sparqlQuery)
csv=CSV.toCSV(qlod)
print(csv)

Related

Push turtle file in bytes form to stardog database using pystardog

def add_graph(file, file_name):
file.seek(0)
file_content = file.read()
if 'snomed' in file_name:
conn.add(stardog.content.Raw(file_content,content_type='bytes', content_encoding='utf-
8'), graph_uri='sct:900000000000207008')
Here I'm facing issues in push the file which I have downloaded from S3 bucket and is in bytes form. It is throwing stardog.Exception 500 on pushing this data to stardog database.
I tried pushing the bytes directly as shown below but that also didn't help.
conn.add(content.File(file),
graph_uri='<http://www.example.com/ontology#concept>')
Can someone help me to push the turtle file which is in bytes form to push in stardog database using pystardog library of Python.
I believe this is what you are looking for:
import stardog
conn_details = {
'endpoint': 'http://localhost:5820',
'username': 'admin',
'password': 'admin'
}
conn = stardog.Connection('myDb', **conn_details) # assuming you have this since you already have 'conn', just sending it to a DB named 'myDb'
file = open('snomed.ttl', 'rb') # just opening a file as a binary object to mimic
file_name = 'snomed.ttl' # adding this to keep your function as it is
def add_graph(file, file_name):
file.seek(0)
file_content = file.read() # this will be of type bytes
if 'snomed' in file_name:
conn.begin() # added this to begin a connection, but I do not think it is required
conn.add(stardog.content.Raw(file_content, content_type='text/turtle'), graph_uri='sct:900000000000207008')
conn.commit() # added this to commit the added data
add_graph(file, file_name) # I just ran this directly in the Python file for the example.
Take note of the conn.add line where I used text/turtle as the content-type. I added some more context so it can be a running example.
Here is the sample file as well snomed.ttl:
<http://api.stardog.com/id=1> a :person ;
<http://api.stardog.com#first_name> "John" ;
<http://api.stardog.com#id> "1" ;
<http://api.stardog.com#dob> "1995-01-05" ;
<http://api.stardog.com#email> "john.doe#example.com" ;
<http://api.stardog.com#last_name> "Doe" .
EDIT - Query Test
If it runs successfully and there are no errors in stardog.log you should be able to see results using this query. Note that you have to specify the Named Graph since the data was added there. If you query without specifying, it will show no results.
SELECT * {
GRAPH <sct:900000000000207008> {
?s ?p ?o
}
}
You can run that query in stardog.studio but if you want it in Python, this will print the JSON result:
print(conn.select('SELECT * { GRAPH <sct:900000000000207008> { ?s ?p ?o } }'))

How do I update column description in BigQuery table using python script?

I can use
SchemaField(f"{field_name}", f"{field_type}", mode="NULLABLE", description=...) while making a new table.
But I want to update the description of the column of the already uploaded table.
Unfortunately, we don’t have such a mechanism available yet to update a column description of the table through the client library. As a workaround, you can try the following available options to update your table column level description:
Option 1: Using the following ALTER TABLE ALTER COLUMN SET OPTIONS data definition language (DDL) statement:
ALTER TABLE `projectID.datasetID.tableID`
ALTER COLUMN Name
SET OPTIONS (
description="Country Name"
);
Refer to this doc for more information about the ALTER COLUMN SET OPTIONS statement.
Option 2: Using the bq command-line tool's bq update command:
Step 1: Get the JSON schema by running the following bq show command:
bq show --format=prettyjson projectID:datasetID.tableID > table.json
Step 2: Then copy the schema from the table.json to schema.json file.
Note: Don’t copy the entire data from the ‘table.json’ file, copy only the schema, it will look something like below:
[
{
"description": "Country Name",
"mode": "NULLABLE",
"name": "Name",
"type": "STRING"
}
]
Step 3: In the ‘schema.json’ file, modify the description name as you like. Then, run the following bq update command to update a table column description.
bq update projectID:datasetID.tableID schema.json
Refer to this doc for more information about bq update command.
Option 3: Calling the tables.patch API method:
Refer to this doc for more information about tables.patch API method.
As per your requirement, I took the following Python code from this medium article and not from the Google Cloud official docs. So Google Cloud will not provide any support for this code.
Step 1: Add the schema in the ‘schema.py’ file and modify the column description name as per your requirement:
#Add field schema
TableObject = {
"tableReference": {
"projectId": "projectID",
"datasetId": "datasetID",
"tableId": "tableID",
},
"schema": {
"fields": [
{
"description": "Country Name",
"mode": "NULLABLE",
"name": "Name",
"type": "STRING"
}
],
},
}
Step 2: Run the following code to get the expected result:
Note: keep that schema.py and following code file in the same directory.
#!/usr/bin/env python
#https://developers.google.com/api-client-library/python/
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
from schema import TableObject
# [START Table Creator]
def PatchTable(bigquery):
tables = bigquery.tables()
tables.patch(
projectId=TableObject['tableReference']['projectId'],\
datasetId=TableObject['tableReference']['datasetId'],\
tableId=TableObject['tableReference']['tableId'], \
body=TableObject).execute()
print ("Table Patched")
# [END]
def main():
#To get credentials
credentials = GoogleCredentials.get_application_default()
# Construct the service object for interacting with the BigQuery API.
bigquery = discovery.build('bigquery', 'v2', credentials=credentials)
PatchTable(bigquery)
if __name__ == '__main__':
main()
print ("BigQuery Table Patch")

Selecting literal values from Wikidata federated query service using RDFLib

I'm trying to get external identifiers for an entity in Wikidata. Using the following query, I can get the literal values (_value) and optionally formatted URLs (value) for Q2409 on the Wikidata Query Service site.
Load in Wikidata Query Service
SELECT ?property ?_value ?value
WHERE {
?property wikibase:propertyType wikibase:ExternalId .
?property wikibase:directClaim ?propertyclaim .
OPTIONAL { ?property wdt:P1630 ?formatterURL . }
wd:Q2409 ?propertyclaim ?_value .
BIND(IF(BOUND(?formatterURL), IRI(REPLACE(?formatterURL, "\\$", ?_value)) , ?_value) AS ?value)
}
Using RDFLib, I'm writing the same query, but with a federated service.
from rdflib import Graph
from rdflib.plugins.sparql import prepareQuery
g = Graph()
q = prepareQuery(r"""
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
SELECT ?property ?_value ?value
WHERE {
SERVICE <https://query.wikidata.org/sparql> {
?property wikibase:propertyType wikibase:ExternalId .
?property wikibase:directClaim ?propertyclaim .
OPTIONAL { ?property wdt:P1630 ?formatterURL . }
wd:Q2409 ?propertyclaim ?_value .
BIND(IF(BOUND(?formatterURL), IRI(REPLACE(?formatterURL, "\\$", ?_value)) , ?_value) AS ?value)
}
}
""")
for row in g.query(q, DEBUG=True):
print(row)
With this, I'm getting the URLs as URIRef objects. But, instead of Literal for the literal values, I'm getting None.
First 6 lines of output:
(rdflib.term.URIRef('http://www.wikidata.org/entity/P232'), None, None)
(rdflib.term.URIRef('http://www.wikidata.org/entity/P657'), None, None)
(rdflib.term.URIRef('http://www.wikidata.org/entity/P6366'), None, None)
(rdflib.term.URIRef('http://www.wikidata.org/entity/P1296'), None, rdflib.term.URIRef('https://www.enciclopedia.cat/EC-GEC-01407541.xml'))
(rdflib.term.URIRef('http://www.wikidata.org/entity/P486'), None, rdflib.term.URIRef('https://id.nlm.nih.gov/mesh/D0068511.html'))
(rdflib.term.URIRef('http://www.wikidata.org/entity/P7033'), None, rdflib.term.URIRef('http://vocabulary.curriculum.edu.au/scot/5001.html'))
What am I missing for the literal values? I'm having trouble figuring out why I'm getting None instead of the values.
I'm not sure if all of the features of SERVICE calls are fully implemented in RDFLib.
I would get this working with a 'normal' call the Wikidata SPARQL endpoint using either RDFLib's SPARQLWrapper library or the general-purpose web request Python libraries requests or httpx first. If that all works, you could then try again with the SERVICE request but you likely won't need it.

How to create date partitioned tables in GBQ? Can you use python?

I have just under 100M records of data that I wish to transform by denormalising a field and then input into a date partitioned GBQ table. The dates go back to 2001.
I had hoped that I could transform it with Python and then use GBQ directly from the script to accomplish this, but after reading up on this and particularly this document it doesn't seem straight-forward to create date-partitioned tables. I'm looking for a steer in the right direction.
Is there any working example of a python script that can do this? Or is it not possible to do via Python? Or is there another method someone can point me in the direction of?
Update
I'm not sure if I've missed something, but the tables created appear to be partitioned as per the insert date of when I'm creating the table and I want to partition by a date set within the existing dataset. I can't see anyway of changing this.
Here's what I've experimenting with:
import uuid
import os
import csv
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
from google.cloud.bigquery import Client
from google.cloud.bigquery import Table
import logging #logging.warning(data_store+file)
import json
import pprint
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path to service account credentials'
client = bigquery.Client()
dataset = client.dataset('test_dataset')
dataset.create()
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.partitioning_type = "DAY"
table.create()
rows = [
('bob', 30),
('bill', 31)
]
table.insert_data(rows)
Is it possible to modify this to take control of the partitions as I create tables and insert data?
Update 2
It turns out I wasn't looking for table partitioning, for my use case it's enough to simply append a date serial to the end of my table name and then query with something along the lines of:
SELECT * FROM `dataset.test_dataset.table_name_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170702'
I don't know whether this is technically still partitioning or not, but as far as I can see it has the same benefits.
Updated to latest version (google-cloud-biquery==1.4.0)
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = bigquery.Table(table_ref, schema=SCHEMA)
if partition not in ('DAY', ):
raise NotImplementedError(f"BigQuery partition type unknown: {partition}")
table.time_partitioning = bigquery.table.TimePartitioning(type_=partition)
table = client.create_table(table) # API request
You can easily create date partitioned tables using the API and Python SDK. Simply set the timePartitioning field to DAY in your script:
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/a14905b6931ba3be94adac4d12d59232077b33d2/bigquery/google/cloud/bigquery/table.py#L219
Or roll your own table insert request with the following body:
{
"tableReference": {
"projectId": "myProject",
"tableId": "table1",
"datasetId": "mydataset"
},
"timePartitioning": {
"type": "DAY"
}
}
Everything is just backed by the REST api here.
Be aware that different versions of google-api-core handle time-partitioned tables differently. For example, using google-cloud-core==0.29.1, you must use the bigquery.Table object to create time-partitioned tables:
from google.cloud import bigquery
MY_SA_PATH = "/path/to/my/service-account-file.json"
MY_DATASET_NAME = "example"
MY_TABLE_NAME = "my_table"
client = bigquery.Client.from_service_account_json(MY_SA_PATH)
dataset_ref = client.dataset(MY_DATASET_NAME)
table_ref = dataset_ref.table(MY_TABLE_NAME)
actual_table = bigquery.Table(table_ref)
actual_table.partitioning_type = "DAY"
client.create_table(actual_table)
I only discovered this by looking at the 0.20.1 Table source code. I didn't see this in any docs or examples. If you're having problems creating time-partitioned tables, I suggest that you identify the version of each Google library that you're using (for example, using pip freeze), and check your work against the library's source code.

How to upload a local CSV to google big query using python

I'm trying to upload a local CSV to google big query using python
def uploadCsvToGbq(self,table_name):
load_config = {
'destinationTable': {
'projectId': self.project_id,
'datasetId': self.dataset_id,
'tableId': table_name
}
}
load_config['schema'] = {
'fields': [
{'name':'full_name', 'type':'STRING'},
{'name':'age', 'type':'INTEGER'},
]
}
load_config['sourceFormat'] = 'CSV'
upload = MediaFileUpload('sample.csv',
mimetype='application/octet-stream',
# This enables resumable uploads.
resumable=True)
start = time.time()
job_id = 'job_%d' % start
# Create the job.
result = bigquery.jobs.insert(
projectId=self.project_id,
body={
'jobReference': {
'jobId': job_id
},
'configuration': {
'load': load_config
}
},
media_body=upload).execute()
return result
when I run this it throws error like
"NameError: global name 'MediaFileUpload' is not defined"
whether any module is needed please help.
One of easiest method to upload to csv file in GBQ is through pandas.Just import csv file to pandas (pd.read_csv()). Then from pandas to GBQ (df.to_gbq(full_table_id, project_id=project_id)).
import pandas as pd
import csv
df=pd.read_csv('/..localpath/filename.csv')
df.to_gbq(full_table_id, project_id=project_id)
Or you can use client api
from google.cloud import bigquery
import pandas as pd
df=pd.read_csv('/..localpath/filename.csv')
client = bigquery.Client()
dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table('new_table')
client.load_table_from_dataframe(df, table_ref).result()
pip install --upgrade google-api-python-client
Then on top of your python file write:
from googleapiclient.http import MediaFileUpload
But care you miss some parenthesis. Better write:
result = bigquery.jobs().insert(projectId=PROJECT_ID, body={'jobReference': {'jobId': job_id},'configuration': {'load': load_config}}, media_body=upload).execute(num_retries=5)
And by the way, you are going to upload all your CSV rows, including the top one that defines columns.
The class MediaFileUpload is in http.py. See https://google-api-python-client.googlecode.com/hg/docs/epy/apiclient.http.MediaFileUpload-class.html

Categories