I have the task to transform and consolidate millions of single JSON file into BIG CSV files.
The operation would be very simple using a copy activity and mapping the schemas, I have already tested, the problem is that a massive amount of files have bad JSON format.
I know what is the error and the fix is very simple too, I figured that I could use a Python Data brick activity to fix the string and then pass the output to a copy activity that could consolidate the records into a big CSV file.
I have something in mind like this, I'm not sure if this is the proper way to address this task. I don't know to use the output of the Copy Activy in the Data Brick activity
It sounds like you want to transform a large number of single JSON file using Azure Data Factory, but it does not support on Azure now as #KamilNowinski said. However, now that you were using Azure Databricks, to write a simple Python script to do the same thing is easier for you. So a workaound solution is to directly use Azure Storage SDK and pandas Python package to do that via few steps on Azure Databricks.
Maybe these JSON files are all in a container of Azure Blob Storage, so you need to list them in container via list_blob_names and generate their urls with sas token for pandas read_json function, the code as below.
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta
account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<your container name>'
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
blob_names = service.list_blob_names(container_name)
blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}" for blob_name in blob_names)
#print(list(blob_urls_with_token))
Then, you can read these JSON file directly from blobs via read_json function to create their pandas Dataframe.
import pandas as pd
for blob_url_with_token in blob_urls_with_token:
df = pd.read_json(blob_url_with_token)
Even if you want to merge them to a big CSV file, you can first merge them to a big Dataframe via pandas functions listed in Combining / joining / merging like append.
To write a dataframe to a csv file, I think it's very easy by to_csv function. Or you can convert a pandas dataframe to a PySpark dataframe on Azure Databricks, as the code below.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
So next, whatever you want to do, it's simple. And if you want to schedule the script as notebook in Azure Databricks, you can refer to the offical document Jobs to run Spark jobs.
Hope it helps.
Copy JSON file to storage (e.g. BLOB) and you can get access to the storage from Databricks. Then you can fix the file using Python and even transform to the required format having cluster run.
So, in Copy Data activity do the copy of the files to BLOB if you haven't them there yet.
Related
I have created a local spark environment in docker. I intend to use this as part of a CICD pipeline for unit testing code executed in the spark environment. I have two scripts which I want to use: 1 will create a set of persistent spark databases and tables and the other will read those tables. Even though the tables should be persistent, they only persist in that specific spark session. If I create a new spark session, I cannot access the tables, even though it is visible in the file system. Code examples are below:
Create db and table
Create_script.py
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('Example').getOrCreate()
columns = ["language", "users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
spark.sql("create database if not exists schema1")
df.write.mode("ignore").saveAsTable('schema1.table1')
Load Data
load_data.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.sql("select * from schema1.table1")
I know there is a problem as when I run this command: print(spark.catalog.listDatabases()) It can only find database default. But if I import Create_script.py then it will find schema1 db.
How do I make persistent tables across all spark sessions?
These files in /repo/test/spark-warehouse is only data of the tables, without meta info of database/table/column.
If you don't enable Hive, Spark use an InMemoryCatalog, which is ephemeral and only for testing, only available in same spark context. This InMemoryCatalog doesn't provide any function to load db/table from file system.
So there is two way:
Columnar Format
spark.write.orc(), write data into orc/parquet format in your Create_script.py script. orc/parquet format store column info aside with data.
val df = spark.read.orc(), then createOrReplaceTempView if you need use it in sql.
Use Embed Hive
You don't need to install Hive, Spark can work with embed hive, just two steps.
add spark-hive dependency. (I'm using Java which use pom.xml to manage dependencies, I don't know how to do it in pyspark)
SparkSession.builder().enableHiveSupport()
Then data will be /repo/test/spark-warehouse/schema1.db, meta info will be /repo/test/metastore_db, which contains files of Derby db. You can read or write tables across all spark sessions.
I have a table on dynamodb with about 50,000 records. I want to pull this whole table to my local computer using aws cli. But I know there is a 1MB limit.
What I've done so far
Get data in json format using aws dynamodb scan --table-name my_table_name> output.json
Then, Convert json output to dataframe using the python code below.
import json
import pandas as pd
with open("output.json", encoding='utf-8-sig', errors='ignore') as json_data:
data = json.load(json_data, strict=False)
df=pd.DataFrame(data["Items"])
After, necessary preprocessing steps...
Is there a way I can pull the whole table?
I've been working on a project that uses a fairly simple data pipeline to clean and transform raw csv files into processed data using Python3.8 and Lambda to create various subsets which are sent to respective S3 buckets. The Lambda function is triggered by uploading a raw csv file to an intake S3 bucket, which initiates the process.
However, I would like to also send some of that processed data directly to Quicksight for ingestion from that same Lambda function for visual inspection as well, and that's where I'm currently stuck.
A portion of the function (leaving out the imports) I have with just the csv processing and uploading to S3, and this is the portion I like direclty ingested to Quicksight:
def featureengineering(event, context):
bucket_name = event['Records'][0]['s3']['bucket']['name']
s3_file_name = event['Records'][0]['s3']['object']['key']
read_file = s3_client.get_object(Bucket=bucket_name,Key=s3_file_name)
#turning the CSV into a dataframe in AWS Lambda
s3_data = io.BytesIO(read_file.get('Body').read())
df = pd.read_csv(s3_data, encoding="ISO-8859-1")
#replacing erroneous zero values to nan (missing) which is more accurate and a general table,
#and creating a new column with just three stages instead for simplification
df[['Column_A','Column_B']] = df[['Column_A','Column_B']].replace(0,np.nan)
#applying function for feature engineering of 'newstage' function
df['NewColumn'] = df.Stage.apply(newstage)
df1 = df
df1.to_csv(csv_buffer1)
s3_resource.Object(bucket1, csv_file_1).put(Body=csv_buffer1.getvalue()) #downloading df1 to S3
So at that point where the df1 is sent to its S3 bucket (which works fine), but I'd like it directly ingested into Quicksight as an automated spice refresh as well.
In digging around I did found a similar question with an answer
import boto3
import time
import sys
client = boto3.client('quicksight')
response = client.create_ingestion(DataSetId='<dataset-id>',IngestionId='<ingetion-id>',AwsAccountId='<aws-account-id>')
but the hang up I'm having is in the DataSetId or more generally, how do I turn the pandas DataFrame df1 in the Lambda Function into something the CreateIngestion API can accept and automatically send to QuickSight as an automated spice refresh of the most recent processed data?
You should first create a Quicksight Dataset, quoting from the docs:
A dataset identifies the specific data in a data source that you want to use. For example, the data source might be a table if you are connecting to a database data source. It might be a file if you are connecting to an Amazon S3 data source.
Once you have saved your DataFrame on S3 (either as a csv or parquet file), you can create a Quicksight Dataset that sources data from it.
You can do so either via Console or programmatically (probably what you’re looking for).
Finally, once you have the Dataset ID you can reference it in other Quicksight API calls.
I am trying to query Google BigQuery using the Pandas/Python client interface. I am following the tutorial here: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas. I was able to get it to work but I want to query the data as the JSON format that can be downloaded directly from the WebUI (see screenshot). Is there a way to download data as the JSON structure pictured instead of converting it to the data frame object?
I imagine the command would be somewhere around this part of the code from the tutorial:
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
Just add .to_json(orient='records') call after converting to dataframe:
json_data = bqclient.query(query_string).result().to_dataframe(bqstorage_client=bqstorageclient).to_json(orient='records')
pandas docs
Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!
You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue