Loading CSV into Neo4j on Azure - python

I have Neo4j operational on Azure. I can load data using python and a series of create statements:
create (n:Person) return n
I can query successfully using python.
Using LOAD CSV requires a file in the Neo4j import directory. I've located that directory, but moving a file into it is blocked. I've also tried putting the file in an accessable directory, but then cannot figure out how to address the path in the LOAD CSV statement.
This LOAD gives an error because the file cannot get into the Neo4j import directory:
USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file:///FTDNATree.csv' AS line FIELDTERMINATOR '|' merge (s:SNPNode{SNP:toString(line.Parent)})
This statement does not find the file and gives an error: EXTERNAL file not found
USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file:///{my directory path/}FTDNATree.csv' AS line FIELDTERMINATOR '|' merge (s:SNPNode{SNP:toString(line.Parent)})
Even though the python and neo4j are in the same resource group, they are different VMs. The problem seems to be the interoperability between the two VM?

If you have access to neo4j.conf, then you can modify the value of dbms.directories.import to point to an accessible directory
See https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.directories.import

The solution was NOT well documented in one place. But here is what evolved by trial and error and which works.
I created a storage account within the resource
Created a directory accessible from code in which the upload file was placed.
Added container, called it neo4j-import
I could then a tranfer the file to the container as a blob (i.e., *.csv file)
You then need to make the file accessible. This involves creating an sas token and attaching it to a URL pointing to the container and the file (see python code to do this below).
You can test this URL in your local browser. It should retrieve the file, which is not accessible without the sas token
This URL is used in the LOAD CSV statement and successfully loads the Neo4j database
The code for step 4; pardon indent issues upon pasting here.
from azure.storage.blob import BlobServiceClient, BlobClient,
ContainerClient, generate_account_sas, ResourceTypes, AccountSasPermissions
def UploadFileToDataStorage(FileName,
UploadFileSourceDirecory=ImportDirectory,BlobConnStr=AzureBlobConnectionString,
Container="neo4j-import"):
#uploads file as blob to data storage
#https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python #upload-blobs-to-a-container
blob_service_client = BlobServiceClient.from_connection_string(BlobConnStr)
blob_client = blob_service_client.get_blob_client(container=Container, blob=FileName)
with open(UploadFileSourceDirecory + FileName, "rb") as data:
blob_client.upload_blob(data)
The key python code (step 5 above).
def GetBlobURLwithSAS(FileName,Container="neo4j-import"):
#https://pypi.org/project/azure-storage-blob/
#https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobserviceclient?view=azure-python
#generates sas token for object blob so it can be consumed by another process
sas_token = generate_account_sas(
account_name="{storage account name}",
account_key="{storage acct key}",
resource_types=ResourceTypes(service=False, container=False, object=True),
permission=AccountSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1))
return "https://{storage account name}.blob.core.windows.net/" + Container + "/" + FileName + "?" + sas_token
The LOAD statement looks like this and does not use the file:/// prefix:
LOAD CSV WITH HEADERS FROM '" + {URL from above} + "' AS line FIELDTERMINATOR '|'{your cypher query for loading csv}
I hope this helps other to navigate this scenario!

Related

Convert .pdf to .docx on Adobe pdf services API (using Python)

I'm trying to write a Python program converting ".pdf" files to ".docx" ones, using Adobe PDF Server API (free trial).
I've found literature enabling to transform any ".pdf" file to a ".zip" file containing ".txt" files (restoring text data) and ".excel" files (returning tabular data).
import logging
import os.path
from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation
logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))
try:
# get base path.
base_path =os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath("C:/..link.../extractpdf/extract_txt_from_pdf.ipynb"))))
# Initial setup, create credentials instance.
credentials = Credentials.service_account_credentials_builder()\
.from_file(base_path + "\\pdfservices-api-credentials.json") \
.build()
#Create an ExecutionContext using credentials and create a new operation instance.
execution_context = ExecutionContext.create(credentials)
extract_pdf_operation = ExtractPDFOperation.create_new()
#Set operation input from a source file.
source = FileRef.create_from_local_file(base_path + "/resources/trs_pdf_file.pdf")
extract_pdf_operation.set_input(source)
# Build ExtractPDF options and set them into the operation
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
.with_element_to_extract(ExtractElementType.TEXT) \
.with_element_to_extract(ExtractElementType.TABLES) \
.build()
extract_pdf_operation.set_options(extract_pdf_options)
#Execute the operation.
result: FileRef = extract_pdf_operation.execute(execution_context)
# Save the result to the specified location.
result.save_as(base_path + "/output/Extract_TextTableau_From_trs_pdf_file.zip")
except (ServiceApiException, ServiceUsageException, SdkException):
logging.exception("Exception encountered while executing operation")
But I can't yet get the conversion done to a ".docx" file, event after changing the name of the extracted file to name.docx
I went to read the litterature of adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options.ExtractPDFOptions() but didn't found ways to tune the extraction and change it from ".zip" to ".docx". What things can I try next?
Unfortunately, right now the Python SDK is only supporting the Extract portion of our PDF services. You could use the services via the REST APIs (https://documentcloud.adobe.com/document-services/index.html#how-to-get-started-) as an alternative.

Cannot write xlsx to GCS from pandas

I have a strange issue.
I trigger a K8S job from airflow as a data pipeline. At the end I need to write the dataframe to a Google Cloud Storage as a .parquet and .xlsx files.
[...]
export_app.to_parquet(f"{output_path}.parquet")
export_app.to_excel(f"{output_path}.xlsx")
Everything is ok for the parquet file but I got an error for the xlsx.
severity: "INFO"
textPayload: "[Errno 2] No such file or directory: 'gs://my_bucket/incidents/prediction/2020-04-29_incidents_result.xlsx'
I try to write the file as a csv to try
export_app.to_parquet(f"{output_path}.parquet")
export_app.to_csv(f"{output_path}.csv")
export_app.to_excel(f"{output_path}.xlsx")
I get the same message every time and I find the other file as expected.
There is any limitation to write a xlsx file ?
I have the package openpyxl installed in my env.
As requested I am passing some codes how I created new xlsx file using directly gcs python3 api. I used this tutorial and this api reference:
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client()
# Create the bucket object
bucket = storage_client.get_bucket("my-new-bucket")
#Confirm bucket connected
print("Bucket {} connected.".format(bucket.name))
#Create file in the bucket
blob = bucket.blob('test.xlsx')
with open("/home/vitooh/test.xlsx", "rb") as my_file:
blob.upload_from_file(my_file)
I hope it will help!

Load file from Azure Files to Azure Databricks

Looking for a way using Azure files SDK to upload files to my azure databricks blob storage
I tried many things using function from this page
But nothing worked. I don't understand why
example:
file_service = FileService(account_name='MYSECRETNAME', account_key='mySECRETkey')
generator = file_service.list_directories_and_files('MYSECRETNAME/test') #listing file in folder /test, working well
for file_or_dir in generator:
print(file_or_dir.name)
file_service.get_file_to_path('MYSECRETNAME','test/tables/input/referentials/','test.xlsx','/dbfs/FileStore/test6.xlsx')
with test.xlsx = name of file in my azure file
/dbfs/FileStore/test6.xlsx => path where to upload the file in my dbfs system
I have the error message:
Exception=The specified resource name contains invalid characters
Tried to change the name but doesn't seem to work
edit: I'm not even sure the function is doing what I want. What is the best way to load file from azure files?
Per my experience, I think the best way to load file from Azure Files is directly to read a file via its url with sas token.
For example, as the figures below, it's a file named test.xlsx in my test file share, that I viewed it using Azure Storage Explorer, then to generate its url with sas token.
Fig 1. Right click the file and then click the Get Shared Access Signature...
Fig 2. Must select the option Read permission for directly reading the file content.
Fig 3. Copy the url with sas token
Here is my sample code, you can run it with the sas token url of your file in your Azure Databricks.
import pandas as pd
url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.xlsx?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)
Alternatively, to use Azure File Storage SDK to generate the url with sas token for your file or to get the bytes of your file for reading, please refer to the offical document Develop for Azure Files with Python and my sample code below.
# Create a client of Azure File Service as same as yours
from azure.storage.file import FileService
account_name = '<your account name>'
account_key = '<your account key>'
share_name = 'test'
directory_name = None
file_name = 'test.xlsx'
file_service = FileService(account_name=account_name, account_key=account_key)
To generate the sas token url of a file
from azure.storage.file import FilePermissions
from datetime import datetime, timedelta
sas_token = file_service.generate_file_shared_access_signature(share_name, directory_name, file_name, permission=FilePermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
url_sas_token = f"https://{account_name}.file.core.windows.net/{share_name}/{file_name}?{sas_token}"
import pandas as pd
pdf = pd.read_excel(url_sas_token)
df = spark.createDataFrame(pdf)
Or using get_file_to_stream function to read the file content
from io import BytesIO
import pandas as pd
stream = BytesIO()
file_service.get_file_to_stream(share_name, directory_name, file_name, stream)
pdf = pd.read_excel(stream)
df = spark.createDataFrame(pdf)
Just as an addition to #Peter Pan answer, the alternative approach without using Pandas with python azure-storage-file-share library.
Very detailed documentation: https://pypi.org/project/azure-storage-file-share/#downloading-a-file

Unable to read csv file uploaded on google cloud storage bucket

Goal - To read csv file uploaded on google cloud storage bucket.
Environment - Run Jupyter notebook using SSH instance on Master node. Using python on Jupyter notebook trying to access a simple csv file uploaded onto google cloud storage bucket.
Approaches -
1st approach - Write a simple python program
Wrote following program
import csv
f = open('gs://python_test_hm/train.csv' , 'rb' )
csv_f = csv.reader(f)
for row in csv_f
print row
Results - Error message "No such file or directory"
2nd Approach - Using gcloud Package tried to access the train.csv file. The sample code is shown below. Below code is not the actual code. The file on google Cloud storage in my version of code was referred to "gs:///Filename.csv"
Results - Error message "No such file or directory"
Load data from CSV
import csv
from gcloud import bigquery
from gcloud.bigquery import SchemaField
client = bigquery.Client()
dataset = client.dataset('dataset_name')
dataset.create() # API request
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.create()
with open('csv_file', 'rb') as readable:
table.upload_from_file(
readable, source_format='CSV', skip_leading_rows=1)
3rd Approach -
import csv
import urllib
url = 'https://storage.cloud.google.com/<bucket>/train.csv'
response = urllib.urlopen(url)
cr = csv.reader(response)
print cr
for row in cr:
print row
Results - Above code doesn't result in any error but it displays the XML content of the google page as shown below. I am interested in viewing the data of the train csv file.
['<!DOCTYPE html>']
['<html lang="en">']
[' <head>']
[' <meta charset="utf-8">']
[' <meta content="width=300', ' initial-scale=1" name="viewport">']
[' <meta name="google-site-verification" content="LrdTUW9psUAMbh4Ia074- BPEVmcpBxF6Gwf0MSgQXZs">']
[' <title>Sign in - Google Accounts</title>']
Can someone throw some light on what could be possibly wrong here and how do I achieve my goal? Your help is highly appreciated.
Thanks so much for your help!
I assume you are using Jupyter notebook running on a machine in Google Cloud Platform (GCP)?
If that's the case, you will already have the Google Cloud SDK running on that machine (by default).
With this setup you have 2 easy options to work with Google Cloud Storage (GCS):
Use the gcloud/gsutil commands in Jupyter
Writing to GCS: gsutil cp train.csv gs://python_test_hm/train.csv
Reading from GCS:
gsutil cp gs://python_test_hm/train.csv train.csv
Use google-cloud python library
Writing to GCS:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('python_test_hm')
blob = bucket.blob('train.csv')
blob.upload_from_string('this is test content!')
Reading from GCS:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('python_test_hm')
blob = storage.Blob('train.csv', bucket)
content = blob.download_as_string()
The sign in page your app fetches isn't actually the object - it's an auth redirect page that, if interacted-with, would proceed to serve the object. You should check out the documentation on Cloud Storage to see about how auth works, and look up the auth details for whichever library or means you use to access the bucket / object.

Creating an on-the-fly zip file from string content for AWS Lambda in Python

I have a Python script that creates a Lambda script in AWS along with all the policies and triggers. I use python boto3 library for that. I create the zip file for the lambda as on-the-fly rather than uploading a static zip file from the hard drive. I use this simple code from below to create my zip file. It creates the zip file without any problems and my python code uploads this zip file as a lambda script and I can view my lambda script in the AWS without any problems. But when I run my lambda script it gives me the module not found error even though I can clearly see that both the module name and the file name does exist and is view-able.
Unable to import module 'xxxx': No module named xxxx
In the file system I double click that zip file that was created by this code and see that the content is created and everything looks normal.
If I bypass zipping on the fly and create the zip statically using WinZip and let the rest of the Python & boto3 script upload this file then it works just fine.
def CreateLambdaZip(self, fileName, fileContent):
with zipfile.ZipFile('Lambda/' + fileName + '.zip', 'w') as myzipc:
myzipc.writestr( fileName + '.py', fileContent)
myzipc.close()
It kinda looks like for the zip file I'm skipping some special headers that is needed by Aws Lambda. Is there such thing? Because in the file system the zip file that is created by Python code and the other one that is created by WinZip are exactly the same. So I know there's nothing wrong with the lambda script.
Update: I'm uploading the zip file using the below code that reads the zip file which was created using the above snippet.
with open('Lambda/'+ fileName +'.zip', 'rb') as zipFile:
func = boto3.client("Lambda").create_function(
FunctionName=lambdaFunction,
Runtime='python2.7',
Role=role['Role']['Arn'],
Handler= fileName + "." + functionName,
Description=description,
Timeout=10,
MemorySize=256,
Publish=True,
Code={'ZipFile': zipFile.read()},
)
When I use zipFile.read() I get 2 different headers for the same content when I zip it using WinZip and when I zip it using Python's module.
Zip file that's created programmatically using Python
b'PK\x03\x04\x14\x00\x00\x00\x00\x00\xe4~\x01IO\x96J=Z\x07\x00\x00Z\x07\x00\x00\x19\x00\x00\x00schedule-ec2-snapshots.pyimport json\nimport boto3\nimport time\nfrom datetime import date, timedelta\n\nprint(\'Loading scheduled EC2 backup actions\')\n\ndef create_snapshots(event, context):\n """\n Lambda function that executes daily snapshots for the instances that
and zipfile created by WinZip
b'PK\x03\x04\x14\x00\x02\x00\x08\x004X\xfcH\x88\x1f\xce\xb5&\x03\x00\x00b\x07\x00\x00\x19\x00\x00\x00schedule-ec2-snapshots.py\x8dU]k\xdb#\x10|7\xf4?,\nA\x12qL\xda\x06B\r~I\x93Bh\x9b\x87&\xf4E\x15\xe1\xac[\xdb\xd7HwBw2\t\xc1\xff\xbd{+\xeb\xcb.\xb4\n\xc4\xba\xdb\xd1\xec\xce\xdc\xae\xa4\x8a\xd2T\x0e~[\xa3\'\xaa\xb9_\x1ag>\xb6\x0b\xa7\n\x9c\xac*S\x80\x14\x0e\xfd\n\xf6\x11\xbf\x9er\\b\xee\xc4dRVJ\xbb(\xfcf\x84Tz\r6\xdb\xa0\xacs\x94p\xfb\xf9\x03,E\xf6\\\x97
With the info above I was able to start the in-memory solution. The deployment of that zip file worked but I could not use the resulting function. Got error:
Unable to import module '<function-name>': No module named <function-name>
I got it to work by specifying the file permissions.
I then use the in-mem-zip to create an AWS lambda function.
Setup:
file_map is a dictionary of full_path->file_bytes.
files is a list of full_paths
def create_lambda_function(function_name, desc, role, handler, file_map, files)
zip_contents = create_in_mem_zip_archive(file_map, files)
result = lambda_code.create_function(
FunctionName=function_name,
Runtime="python2.7",
Description=desc,
Role=role,
Handler=handler,
Code={'ZipFile': zip_contents},
)
return result
def create_in_mem_zip_archive(file_map, files):
buf = io.BytesIO()
logger.info("Building zip file: " + str(files))
with zipfile.ZipFile(buf, 'w', zipfile.ZIP_DEFLATED) as zfh:
for file_name in files:
file_blob = file_map.get(file_name)
if file_blob is None:
logger.error("Missing file {} from files".format(file_name))
continue
try:
info = zipfile.ZipInfo(file_name)
info.date_time = time.localtime()
info.compress_type = zipfile.ZIP_DEFLATED
info.external_attr = 0777 << 16L # give full access
# info.external_attr = 0644 << 16L # -r-wr--r--
# info.external_attr = 0755 << 16L # -rwxr-xr-x
zfh.writestr(info, file_blob)
except Exception as ex:
logger.info("Error reading file: " + file_name + ", error: " + ex.message)
buf.seek(0)
return buf.read()
I have experienced the exactly same problem you have. My solution is do NOT use on the fly zip file. Create a real zip file and add real file into it, and it just works. You can do that even in the lambda environment, by create filepath like "/tmp/yourfile.txt" you can create temp real file when lambda execute.

Categories