Push turtle file in bytes form to stardog database using pystardog - python

def add_graph(file, file_name):
file.seek(0)
file_content = file.read()
if 'snomed' in file_name:
conn.add(stardog.content.Raw(file_content,content_type='bytes', content_encoding='utf-
8'), graph_uri='sct:900000000000207008')
Here I'm facing issues in push the file which I have downloaded from S3 bucket and is in bytes form. It is throwing stardog.Exception 500 on pushing this data to stardog database.
I tried pushing the bytes directly as shown below but that also didn't help.
conn.add(content.File(file),
graph_uri='<http://www.example.com/ontology#concept>')
Can someone help me to push the turtle file which is in bytes form to push in stardog database using pystardog library of Python.

I believe this is what you are looking for:
import stardog
conn_details = {
'endpoint': 'http://localhost:5820',
'username': 'admin',
'password': 'admin'
}
conn = stardog.Connection('myDb', **conn_details) # assuming you have this since you already have 'conn', just sending it to a DB named 'myDb'
file = open('snomed.ttl', 'rb') # just opening a file as a binary object to mimic
file_name = 'snomed.ttl' # adding this to keep your function as it is
def add_graph(file, file_name):
file.seek(0)
file_content = file.read() # this will be of type bytes
if 'snomed' in file_name:
conn.begin() # added this to begin a connection, but I do not think it is required
conn.add(stardog.content.Raw(file_content, content_type='text/turtle'), graph_uri='sct:900000000000207008')
conn.commit() # added this to commit the added data
add_graph(file, file_name) # I just ran this directly in the Python file for the example.
Take note of the conn.add line where I used text/turtle as the content-type. I added some more context so it can be a running example.
Here is the sample file as well snomed.ttl:
<http://api.stardog.com/id=1> a :person ;
<http://api.stardog.com#first_name> "John" ;
<http://api.stardog.com#id> "1" ;
<http://api.stardog.com#dob> "1995-01-05" ;
<http://api.stardog.com#email> "john.doe#example.com" ;
<http://api.stardog.com#last_name> "Doe" .
EDIT - Query Test
If it runs successfully and there are no errors in stardog.log you should be able to see results using this query. Note that you have to specify the Named Graph since the data was added there. If you query without specifying, it will show no results.
SELECT * {
GRAPH <sct:900000000000207008> {
?s ?p ?o
}
}
You can run that query in stardog.studio but if you want it in Python, this will print the JSON result:
print(conn.select('SELECT * { GRAPH <sct:900000000000207008> { ?s ?p ?o } }'))

Related

How to give output filename in DataflowStartFlexTemplateOperator

I am using DataflowStartFlexTemplateOperator dag operator in airflow to export my bigquery external table data to parquet file format with desired number of output files.
But here how to give output file names. At least the filename prefix.
Below is my code
export_to_gcs = DataflowStartFlexTemplateOperator(
task_id=f'export_to_gcs_day_{day_num}',
project_id=PROJECT_ID,
body={
'launchParameter': {
'containerSpecGcsPath': 'gs://dataflow-templates-us-central1/latest/flex/BigQuery_to_Parquet',
'jobName': f'tivo-export-to-gcs-{run_date}',
'environment': {
'stagingLocation': f'gs://{GCS_BUCKET_NAME}/{STAGING_LOCATION}',
'numWorkers': '1',
'maxWorkers': '20',
'workerRegion': 'us-central1',
'serviceAccountEmail': DF_SA_NAME,
'machineType': 'n1-standard-4',
'ipConfiguration': 'WORKER_IP_PRIVATE',
'tempLocation': f'gs://{GCS_BUCKET_NAME}/{TEMP_LOCATION}',
'subnetwork': SUBNETWORK,
'enableStreamingEngine': False
},
'parameters': {
'tableRef': f'{PROJECT_ID}:{DATASET_NAME}.native_firehose_table',
'bucket': f'gs://{GCS_BUCKET_NAME}/Tivo/site_activity/{run_date}/',
'numShards': '25',
},
}
},
location=REGION_NAME,
wait_until_finished=True,
dag=dag
)
"BigQuery_to_Parquet" template does not allow you to add a prefix. To do that, you would actually need to download the template and create a custom version.
This is the part in the source code where the files are written:
* Step 2: Write records to Google Cloud Storage as one or more Parquet files
* via {#link ParquetIO}.
*/
.apply(
"WriteToParquet",
FileIO.<GenericRecord>write()
.via(ParquetIO.sink(schema))
.to(options.getBucket())
.withNumShards(options.getNumShards())
.withSuffix(FILE_SUFFIX));
Maybe you can take some inspiration from other template that was actually created supporting a prefix, such as streaming data generator, but keep in mind that its purpose is different, and also is streaming instead of batch.
StreamingDataGeneratorWriteToGcs.java takes the prefix parameter from the parameters:
/** Converts the fake messages in bytes to json format and writes to a text file. */
private void writeAsParquet(PCollection<GenericRecord> genericRecords, Schema avroSchema) {
genericRecords.apply(
"Write Parquet output",
FileIO.<GenericRecord>write()
.via(ParquetIO.sink(avroSchema))
.to(getPipelineOptions().getOutputDirectory())
.withPrefix(getPipelineOptions().getOutputFilenamePrefix())
.withSuffix(getPipelineOptions().getOutputType().getFileExtension())
.withNumShards(getPipelineOptions().getNumShards()));
}
Another option is just to export the files in a very specific folder where they cannot be mixed with any other process and apply a mass renaming in a subsequent task.

Open multiple Json files with URL's and download the files contained in each using Python

We will receive up to 10k JSON files in a separate directory that must be parsed and converted to separate .csv files. Then the file at the URL in each must be downloaded to another directory. I was planning on doing this in Automator on the Mac and calling a Python script for downloading the files. I have the portion of the shell script done to convert to CSV but have no idea where to start with python to download the URLs.
Here's what I have so far for Automator:
- Shell = /bin/bash
- Pass input = as arguments
- Code = as follows
#!/bin/bash
/usr/bin/perl -CSDA -w <<'EOF' - "$#" > ~/Desktop/out_"$(date '+%F_%H%M%S')".csv
use strict;
use JSON::Syck;
$JSON::Syck::ImplicitUnicode = 1;
# json node paths to extract
my #paths = ('/upload_date', '/title', '/webpage_url');
for (#ARGV) {
my $json;
open(IN, "<", $_) or die "$!";
{
local $/;
$json = <IN>;
}
close IN;
my $data = JSON::Syck::Load($json) or next;
my #values = map { &json_node_at_path($data, $_) } #paths;
{
# output CSV spec
# - field separator = SPACE
# - record separator = LF
# - every field is quoted
local $, = qq( );
local $\ = qq(\n);
print map { s/"/""/og; q(").$_.q("); } #values;
}
}
sub json_node_at_path ($$) {
# $ : (reference) json object
# $ : (string) node path
#
# E.g. Given node path = '/abc/0/def', it returns either
# $obj->{'abc'}->[0]->{'def'} if $obj->{'abc'} is ARRAY; or
# $obj->{'abc'}->{'0'}->{'def'} if $obj->{'abc'} is HASH.
my ($obj, $path) = #_;
my $r = $obj;
for ( map { /(^.+$)/ } split /\//, $path ) {
if ( /^[0-9]+$/ && ref($r) eq 'ARRAY' ) {
$r = $r->[$_];
}
else {
$r = $r->{$_};
}
}
return $r;
}
EOF
I'm unfamiliar with Automator so perhaps someone else can address that but as far as the Python portion goes, it is fairly simple to download a file from a url. It would go something like this:
import requests
r = requests.get(url) # assuming you don't need to do any authentication
with open("my_file_name", "wb") as f:
f.write(r.content)
Requests is a great library for handling http(s) and since the content attribute of the Response is a byte string we can open a file for writing bytes (the "wb") and write it directly. This works for executable payloads too so be sure you know what you are downloading. If you don't already have requests installed run pip install requests or the Mac equivalent.
If you were inclined to do your whole process in python I would suggest you look at the json and csv packages. Both of these are part of the standard library and provide high-level interfaces for exactly what you are doing
Edit:
Here's an example if you were using the json module on a file like this:
[
{
"url": <some url>,
"name": <the name of the file>
}
]
Your Python code might look similar to this:
import requests
import json
with open("my_json_file.json", "r") as json_f:
for item in json.load(json_f)
r = requests.get(item["url"])
with open(item["name"], "wb") as f:
f.write(r.content)

SPARQLWrapper : problem in querying an ontology in a local file

I'm working with SPARQLWrapper and I'm following the documentation. Here is my code:
queryString = "SELECT * WHERE { ?s ?p ?o. }"
sparql = SPARQLWrapper("http://example.org/sparql")# I replaced this line with
sparql = SPARQLWrapper("file:///thelocation of my file in my computer")
sparql.setQuery(queryString)
try :
ret = sparql.query()
# ret is a stream with the results in XML, see <http://www.w3.org/TR/rdf-sparql-XMLres/>
except :
deal_with_the_exception()
I'm getting these 2 errors:
1- The system cannot find the path specified
2- NameError: name 'deal_with_the_exception' is not defined
You need a SPARQL endpoint to make it work. Consider setting up Apache Fuseki in your local computer. See https://jena.apache.org/documentation/fuseki2/jena

RavenDB Object properly saved but some empty attributes when querying

I'm currently trying to save some python objects (websites) via PyRavenDB in a RavenDB database. The problem is that data are saved properly, but when I test it by querying the results, some of the attributes are returned empty.
The code is simple, I can't properly find be the problem.
The JSON object in the database is the following (verified via the DB web UI).
{
"htmlCode": "<code>TEST HTML</code>",
"added": "2017-02-21",
"uniqueid": "262e4584f3e546afa2c67045a0096b54",
"url": "www.example.com",
"myHash": "d41d8cd98f00b204e9800998ecf8427e",
"lastaccessed": "2017-02-21"
}
When I use this code to query
from pyravendb.store import document_store
store = document_store.documentstore(url="http://somewhere:someport", database="websites")
store.initialize()
with store.open_session() as session:
query_result = list(session.query().where_equals("www.example.com", url))
print query_result
print type(query_result)
return query_result
It returns this object :
{
'uniqueid': 'f942e86f965d4709a2d69caca3001f2a',
'url': '',
'myHash': 'd41d8cd98f00b204e9800998ecf8427e',
'htmlCode': '',
'added': '2017-02-21',
'lastaccessed': '2017-02-21'
}
As you can see, url and html code are empty. They should be okey since in DB they are properly stored.
Thanks.
The problem here is that you don't use the where_equal right.
where_equal first argument is the field name you want to query with and then the value (def where_equals(self, field_name, value)).
Just change this line query_result = list(session.query().where_equals("www.example.com", url))
To this query_result = list(session.query().where_equals("url", "www.example.com"))
This will fix your problem

requests.post isn't passing my string into the contents of my file

I am using a node.js restify server code that accepts text file upload and a python client code that uploads the text file.
Here is the relevant node.js server code;
server.post('/api/uploadfile/:devicename/:filename', uploadFile);
//http://127.0.0.1:7777/api/uploadfile/devname/filename
function uploadFile(req, res, next) {
var path = req.params.devicename;
var filename = req.params.filename;
console.log("Upload file");
var writeStream = fs.createWriteStream(path + "/" + filename);
var r = req.pipe(writeStream);
res.writeHead(200, {"Content-type": "text/plain"});
r.on("drain", function () {
res.write(".", "ascii");
});
r.on("finish", function () {
console.log("Upload complete");
res.write("Upload complete");
res.end();
});
next();
}
This is the python2.7 client code
import requests
file_content = 'This is the text of the file to upload'
r = requests.post('http://127.0.0.1:7777/api/uploadfile/devname/filename.txt',
files = {'filename.txt': file_content},
)
The file filename.txt did appear on the server filesystem. However, the problem is that the contents is empty. If things went right, the content This is the text of the file to upload should appear but it did not. What is wrong with the code? I am not sure if it is server or client or both code that are wrong.
It looks like you're creating a file but never actually getting the uploaded file contents. Check out the bodyParser example at http://restify.com/#bundled-plugins. You need to give the bodyParser a function for handling multi-part data.
Alternatively, you could just use bodyParser without your own handlers and look for the uploaded file information in req.files, including where the temporary uploaded file is for copying to wherever you like.
var restify = require('restify');
var server = restify.createServer();
server.use(restify.bodyParser());
server.post('/upload', function(req, res, next){
console.log(req.files);
res.end('upload');
next();
});
server.listen(9000);

Categories