Write a variable's data as it is in ADLS file - python

I want to write the content of a variable that is created dynamically in the program to a ADLS file.
This is how I am getting the data -
#dataclass
class pipeline_run:
id:str
group_id:str
run_start:str
run_end:str
pipeline_name:str
pipeline_status:str
parameters:str
message:str
addl_properties:str
runs = adf_client.pipeline_runs.query_by_factory(rg_name, df_name,
filter_parameters={'lastUpdatedBefore': 'date'})
#Array to hold the pipeline run returns
list_of_json_data = list()
for i in range(0, len(runs.value)):
this_run = runs.value[i]
#Gathering each runs information and storing into dataclass
data = pipeline_run(this_run.run_id, this_run.run_group_id, this_run.run_start, this_run.run_end, this_run.pipeline_name,
this_run.status, this_run.parameters, this_run.message, this_run.additional_properties)
#Converting dataclass to dict and storing in the created list
list_of_json_data.append(asdict(data))
Now, I want list_of_json_date to write in an ADLs File (.json). Any help would be appreciated. Thanks!

As you already have list of dictionaries as variable, follow the below approach to achieve your requirement.
First Create an ADLS Gen2 Linked Service in Synapse.
Then Mount your Target container using the Linked service.
mssparkutils.fs.mount(
"abfss://<container_name>#<Storage_account_name>.dfs.core.windows.net",
"/<Mountpoint_name>",
{"linkedService":"<Linked_service_name>"}
)
After mounting, you can do it either by using Pyspark dataframe and pandas or with open() and json in python. Use the mount point in building the file path.
Using Pyspark dataframe and pandas:
Here I have used a sample list of dictionaries as the variable list_of_json_data.
Code:
list_of_json_data=[{"id": "24", "group_id": "1224", "run_id": "990b0720-4747-4992-b87f-a74e1078a5f1"},
{"id": "16", "group_id": "1216", "run_id": "990b0720-4747-4992-b87f-a74e1078a5f1"},
{"id": "20", "group_id": "2408", "run_id": "990b0720-4747-4992-b87f-a74e1078a5f1"}]
#Create spark dataframe with variable
df = spark.createDataFrame(list_of_json_data)
display(df)
#Get the spark Job Id to build the path
jobid=mssparkutils.env.getJobId()
#Building the path
LogFilepath='/synfs/'+jobid+'/sourcedata/Sample2.json'
print(LogFilepath)
#Write to JSON path
df.toPandas().to_json(LogFilepath,orient='records')
Result JSON file:
Using with open() and json:
Use the below code:
import json
#get the spark Job Id to build File path
jobid=mssparkutils.env.getJobId()
Filepath='/synfs/'+jobid+'/sourcedata/PipelinesJSON.json'
#Write to JSON Path
with open(Filepath, 'w') as f:
json.dump(list_of_json_data , f)
Result JSON file:

Related

Merging multiple JSON files into single JSON file in S3 from AWS Lambda python function

I am strucked in my work where my requirement is combining multiple json files into single json file and need to compress it in s3 folder
Somehow I did but the json contents are merging in dictionary and I know I have used Dictionary to load my json content from files because I tried with loading as List but it throws mw JSONDecodeError "Extra data:line 1 column 432(431)"
my file looks like below:
file1 (no .json extension will be there)
{"abc":"bcd","12354":"31354321"}
file 2
{"abc":"bcd","12354":"31354321":"hqeddeqf":"5765354"}
my code-
import json
import boto3
s3_client=boto3.client('s3')
bucket_name='<my bucket>'
def lambda_handler(event,context):
key='<Bucket key>'
jsonfilesname = ['<name of the json files which stored in list>']
result=[]
json_data={}
for f in (range(len(jsonfilesname))):
s3_client.download_file(bucket_name,key+jsonfilesname[f],'/tmp/'+key+jsonfilesname[f])
infile = open('/tmp/'+jsonfilesname[f]).read()
json_data[infile] = result
with open('/tmp/merged_file','w') as outfile:
json.dump(json_data,outfile)
my output for the outfile by the above code is
{
"{"abc":"bcd","12354":"31354321"}: []",
"{"abc":"bcd","12354":"31354321":"hqeddeqf":"5765354"} :[]"
}
my expectation is:
{"abc":"bcd","12354":"31354321"},{"abc":"bcd","12354":"31354321":"hqeddeqf":"5765354"}
Please someone help and advice what needs to be done to get as like my expected output
First of all:
file 2 is not a valid JSON file, correctly it should be:
{
"abc": "bcd",
"12354": "31354321",
"hqeddeqf": "5765354"
}
Also, the output is not a valid JSON file, what you would expect after merging 2 JSON files is an array of JSON objects:
[
{
"abc": "bcd",
"12354": "31354321"
},
{
"abc": "bcd",
"12354": "31354321",
"hqeddeqf": "5765354"
}
]
Knowing this, we could write a Lamdda to merge JSONS files:
import json
import boto3
s3 = boto3.client('s3')
def lambda_handler(event,context):
bucket = '...'
jsonfilesname = ['file1.json', 'file2.json']
result=[]
for key in jsonfilesname:
data = s3.get_object(Bucket=bucket, Key=key)
content = json.loads(data['Body'].read().decode("utf-8"))
result.append(content)
# Do something with the merged content
print(json.dumps(result))
If you are using AWS, I would recommend using S3DistCp
for json file merging as it provides a fault-tolerant, distributed way that can keep up with large files as well by leveraging MapReduce
. However, it does not seem to support in-place merging.

How to connect big files from Minio bucket, split it into multiple files based on timestamp and store it back using Dask framework in python

I have a big text (with millions of record) as bz2 format in Minio bucket.
Now I am processing them demonstrated by the procedure below:
Call the file from Minio bucket;
Partition the files per day based on 'timestamp' column;
Remove some of the empty/blank partitioned files using 'cull_empty_partitions()';
Save the partitioned files in local directory as a .csv;
Save it back to the Minio bucket;
Remove the files from local workspace.
In this current procedure, I have to store the files into the local workspace which I don't want.
All I want to read are the .txt or the .bz2 files from my bucket, without using the local workspace partition.
Then, name the partition based on the first date in the 'timestamp' column in Dask datafrmar and store them back directly into the Minio bucket using Dask framework.
Here is my code:
import dask.dataframe as dd
from datetime import date, timedelta
path='/a/desktop/workspace/project/log/'
bucket= config['data_bucket']['abc']
folder_prefix = config["folder_prefix"]["root"]
folder_store = config["folder_prefix"]["store"]
col_names = [
"id", "pro_id", "tr_id", "bo_id", "se", "lo", "timestamp", "ch"
]
data = dd.read_csv(
folder_prefix + 'abc1/2001-01-01-logging_*.txt',
sep = '\t', names = col_names, parse_dates = 'timestamp'],
low_memory = False
)
data['timestamp'] = dd.to_datetime(
user_data['timestamp'], format= '%Y-%m-%d %H:%M:%S',
errors = 'ignore'
)
ddf = data.set_index(user_data['timestamp']).repartition(freq='1d').dropna()
# Remove the dask dataframe which are splitted as empty
ddf = cull_empty_partitions(ddf)
# Storing the partitioned dask files to local workspace as a csv
o = ddf.to_csv("out_csv/log_user_*.csv", index=False)
# Storing the file in minio bucket
for each in o:
if len(each) > 0 :
print(each.split("/")[-1])
minioClient.fput_object(bucket, folder_store+ each.split("/")[-1], each)
# Removing partitioned csv files from local workspace
os.remove(each)
I can use the below code to connect the s3 buckets and get access to see the list of bucket:
import botocore, os
from botocore.client import Config
from botocore.session import Session
s3 = boto3.resource('s3',
endpoint_url='https://blabalbla.com',
aws_access_key_id= "abcd",
aws_secret_access_key="sfsdfdfdcdfdfedfsdfsdf",
config=Config(signature_version='s3v4'),
region_name='us-east-1')
os.environ['S3_USE_SIGV4'] = 'True'
for bucket in s3.buckets.all():
print(bucket.name)
When try to read the object of the buckets with the code below, it does not respond.
df = dd.read_csv('s3://bucket/myfiles.*.csv')
Any update on this regard will be highly appreciated. Thank you in advance!

Writing pandas DataFrame to Azure Blob Storage from Azure Function

I am writing a simple Azure Function to read an input blob, create a pandas DataFrame from it and then write it to Blob Storage again as a CSV. I have the code given below to read the file and convert it into a DataFrame,
import logging
import pandas as pd
import io
import azure.functions as func
def main(inputBlob: func.InputStream):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {inputBlob.name}\n"
f"Blob Size: {inputBlob.length} bytes")
df = pd.read_csv(io.BytesIO(inputBlob.read()), sep='#', encoding='unicode_escape', header=None, names=range(16))
logging.info(df.head())
How can I write this DataFrame out to Blob Storage?
I have uploaded the file with below code, target is the container and target.csv is the blob which we want to write and store.
blob_service_client = BlobServiceClient.from_connection_string(CONN_STR)
# WRITE HEADER TO A OUT PUTFILE
output_file_dest = blob_service_client.get_blob_client(container="target", blob="target.csv")
#INITIALIZE OUTPUT
output_str = ""
#STORE COULMN HEADERS
data= list()
data.append(list(["column1", "column2", "column3", "column4"]))
# Adding data to a variable. Here you can pass the input blob. Also look for the parameters that sets your requirement in upload blob.
output_str += ('"' + '","'.join(data[0]) + '"\n')
output_file_dest.upload_blob(output_str,overwrite=True)
From the above code you can ignore #STORE COULMN HEADERS and replace with input blob read data which you have done it using pandas.

python script to get the instance ID from csv files and start the instance

I have csv files which has multiple ec2 instance IDs...
is there any way I can run python script and fetch all the instance id and start all of them ..
It is a rather straightforward task.
Read all rows from CSV.
Submit call to EC2 using the extracted info.
import csv
import boto3
with open('instances.csv') as csv_file:
reader = csv.DictReader(csv_file)
instances = [row["Instance Id"] for row in reader]
client = boto3.client('ec2')
response = client.start_instances(InstanceIds=instances)

Python CSV writer missing row of data

I'm experiencing a problem when I dump json data into a CSV file. There is typically a block of json data that is missing from my the CSV file, but can be seen if I print the json in the console or to a file.
Essentially I am calling a service twice and receiving back two json responses that I parse and dump into a CSV file. The service can only be called for 7 day increments (unix time), so I have implemented logic to call the service for this increment over a period of time.
I'm using the python vanilla json and csv libraries.
First the CSV is created with headers:
with open ('history_' + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")+'.csv', 'wb') as outcsv:
writer = csv.writer(outcsv)
writer.writerow(["Column1","Column2", "Column3", "Column4", "Column5",
"Column6"])
Then, I have a counter that calls the service twice, fifty times (following the open of the CSV file):
while y<50:
jsoResponseOne = getJsonOne(7)
jsonResponseTwo = getJsonTwo(7)
Example json response:
{"Value":
[
{"ExampleName": "Test",
"ExampleNameTwo": "Test2",
"ExampleDate": "1436103790",
"ExampleCode": 00000001,
"ExampleofExample": "abcd",
"AnotherExample": "hello"},
{"ExampleName": "Test2",
"ExampleNameTwo": "Test3",
"ExampleDate": "1436103790",
"ExampleCode": 00000011,
"ExampleofExample": "abcd",
"AnotherExample": "hello2"},
]
}
The CSV output columns would look like:
ExampleName ExampleNameTwo ExampleDate ExampleCode ExampleofExample AnotherExample
Finally, the CSV is written as follows:
for item in jsonResponseOne['Value']:
row = []
row.append(str(item['ExampleName'].encode('utf-8')))
if item.get("ExampleNameTwo"):
row.append(str(item["ExampleNameTwo"]))
else:
row.append("None")
row.append(str(item['ExampleDate']))
row.append(str(item['ExampleCode'].encode('utf-8')))
row.append(str(item['ExampleofExample'].encode('utf-8')))
row.append(str(item['AnotherExample'].encode('utf-8')))
writer.writerow(row)
for item in jsonResponseTwo['Value']:
anotherRow= []
anotherRow.append(str(item['ExampleName'].encode('utf-8')))
if item.get("ExampleNameTwo"):
anotherRow.append(str(item["ExampleNameTwo"]))
else:
anotherRow.append("None")
anotherRow.append(str(item['ExampleDate']))
anotherRow.append(str(item['ExampleCode'].encode('utf-8')))
anotherRow.append(str(item['ExampleofExample'].encode('utf-8')))
anotherRow.append(str(item['AnotherExample'].encode('utf-8')))
writer.writerow(anotherRow)
Why could my CSV output be missing an entire row of data (a block of data from the JSON response)?
Resolved.
The Python script had an indentation issue in the one of the while loops, causing some data to be skipped over and not written to the CSV file.

Categories