I just linked an Azure storage account (Storage gen2) with its underlying containers to my Databricks environment. Inside the storage account are two containers each with some subdirectories. Inside the folders are .csv files.
I have connected an Azure service principal with Azure Blog Data Contributor access to the storage account inside databricks so I can read and write to the storage account.
I am trying to figure out the best way to convert the existing storage account into a delta lake (tables inside the metastore + convert the files to parquet (delta tables).
What is the easiest way to do that?
My naive approach as a beginner might be
Read the folder using
spark.read.format("csv).load("{container}#{storage}..../directory)
Write to a new folder with similar name (so if folder is directory, write it to directory_parquet) using df.write.format("delta").save({container}#{storage}.../directory_parquet)
And then not sure on the last steps? This would create a new folder with a new set of files. But it wouldn't be a table in databricks that shows up in the hive store. But I do get parquet files.
Alternatively I can use df.write.format().saveAsTable("tablename") but that doesn't create the table in the storage account, but inside the databricks file system, but does show up in the hive metastore.
delete the existing data files if desired (or have it duplicated)
Preferably this can be done in a Databricks workbook using python as preferred, or scala/sql if necessary.
*As a possible solution, if the efforts to do this are monumental, just converting to parquet and getting table information for each subfolder into hive storage as a format of database=containerName
tableName=subdirectoryName
The folder structure is pretty flat at the moment, so only rootcontainer/Subfolders deep.
Perhaps an external table is what you're looking for:
df.write.format("delta").option("path", "some/external/path").saveAsTable("tablename")
This post has more info on external tables vs managed tables.
Related
Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.
I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.
The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.
I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?
My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.
You have 2 solutions:
Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.
You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.
We are having service account credentials of a google project. Using that, we need to extract data from a table and load into a table in different project.
Approach 1:
Extract the data into google storage --> Download the files to local --> load into a table in different project.
When, we use client.extract_table method, we will be able to export the data to files in google storage. But, we do not have storage.buckets.create permission. Due to that, bucket cannot be created to export the files into google storage.
Is it possible to extract the data into files in local?
Approach2:
Read the table table data and write as dataframe using pandas. But, it requires readsession.create permission. It is not available for the service account credentials.
Is there any possibility to extract the files into local without adding any new permission?
Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!
You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue
My goal was to duplicate my Google App Engine application. I created new application, and upload all needed code from source application(python). Then I uploaded previously created backup files from the Cloud Storage of the source application (first I downloaded those files to PC and than uploaded files to GCS bucket of the target app)
After that I tried to restore data from those files, by using "Import Backup Information" button.
Backup information file is founded and I can add it to the list of available backups. But when I try to do restore I receive error: "There was a problem kicking off the jobs. The error was: Backup not readable"
Also I tried to upload those files back to original application and I was able to restore from them, by using the same procedure, so the files are not corrupted.
I know there are another methods of copying data between applications, but I wanted to use this method. If for example, my Google account is being hacked and I can not access my original application data, but I have all backup data on my hard drive. Then I can simply create new app and copy all data to the new app...
Has anyone before encountered with the similar problem, and maybe found some solution?
Thanks!
Yes!! What you are trying to do is not possible. The reason is that there are absolute references in the backup files to the original backup location (bucket). So moving the files to another GCS location will not work.
Instead you have to leave the backup files in the original GCS bucket and give your new project read access to that folder. That is done in the "Edit bucket permissions" option. eg. add:
Project - owners-12345678 - Reader
Now you are able to import from that bucket in your new project in "Import Bucket Information".
Given the message, my guess is that the target application has no read access to the bucket where the backup is stores. Add the application to the permitted users to that bucket before creating the backup so that the backup objects will inherit the permission.
Within AWS Glue how do I deal with files from S3 that will change every week.
Example:
Week 1: “filename01072018.csv”
Week 2: “filename01142018.csv”
These files are setup in the same format but I need Glue to be able to change per week to load this data into Redshift from S3. The code for Glue uses native Python as the backend.
AWS Glue crawlers should be able to just find your CSV files as they are named without any configuration on your part.
For instance, my Kinesis stream produces files that have paths and names that look like these:
my_events_folder/2018/02/13/20/my-prefix-3-2018-02-13-20-18-28-112ab3f0-5794-4f77-9a84-83efafeecabc
my_events_folder/2018/02/13/20/my-prefix-2-2018-02-13-20-12-00-7f2efb62-827b-46a6-83c4-b4c52dd87d60
...
AWS Glue just finds these files and classifies them automatically. Hope this helps.
AWS Glue should be able to process all the files in a folder irrespective of the name in a single job. If you don’t want the old file to be processed again move it using boto3 api for s3 to another location after each run.
If you have two different TYPES of files (with different internal formats), they must be in separate folder hierarchies. There is no way to tell a crawler to only look for redfile*.csv and ignore bluefile%.csv. Instead user separate hierarchies like:
s3://my-bucket/redfiles/
redfile01072018.csv
redfile01142018.csv
...
s3://my-bucket/bluefiles/
bluefile01072018.csv
bluefile01142018.csv
...
Setup two crawlers, one crawling s3://my-bucket/redfiles/ and the other crawling s3://my-bucket/bluefiles/