Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!
You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue
Related
I just linked an Azure storage account (Storage gen2) with its underlying containers to my Databricks environment. Inside the storage account are two containers each with some subdirectories. Inside the folders are .csv files.
I have connected an Azure service principal with Azure Blog Data Contributor access to the storage account inside databricks so I can read and write to the storage account.
I am trying to figure out the best way to convert the existing storage account into a delta lake (tables inside the metastore + convert the files to parquet (delta tables).
What is the easiest way to do that?
My naive approach as a beginner might be
Read the folder using
spark.read.format("csv).load("{container}#{storage}..../directory)
Write to a new folder with similar name (so if folder is directory, write it to directory_parquet) using df.write.format("delta").save({container}#{storage}.../directory_parquet)
And then not sure on the last steps? This would create a new folder with a new set of files. But it wouldn't be a table in databricks that shows up in the hive store. But I do get parquet files.
Alternatively I can use df.write.format().saveAsTable("tablename") but that doesn't create the table in the storage account, but inside the databricks file system, but does show up in the hive metastore.
delete the existing data files if desired (or have it duplicated)
Preferably this can be done in a Databricks workbook using python as preferred, or scala/sql if necessary.
*As a possible solution, if the efforts to do this are monumental, just converting to parquet and getting table information for each subfolder into hive storage as a format of database=containerName
tableName=subdirectoryName
The folder structure is pretty flat at the moment, so only rootcontainer/Subfolders deep.
Perhaps an external table is what you're looking for:
df.write.format("delta").option("path", "some/external/path").saveAsTable("tablename")
This post has more info on external tables vs managed tables.
Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.
I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.
The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.
I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?
My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.
You have 2 solutions:
Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.
You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.
Currently I am building a data pipeline where I wanted to copy data from one blob storage to Azure data lake through Azure data factory but before creating a data pipeline i wanted to have a file check kind of thing ie it should check the directory if file found or not ,for eg: i have a csv file if file is present then start copying to adls otherwise through an error filenot found. I know we can do this in python but in adf how to add that in pipeline. Any help will be appreciated.
I would use metadata activity to get a list of all items in your blob storage (select your blob as a dataset):
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
Then you might need to check if an item is a file, not a folder. For that you can add a combination of "ForEach" and "If condition" activities. In that case you can refer to each item from the Metadata step using #activity('GetMetadata').output.childitems expression and #equals(item().type, 'File') expression to check if it is a file.
Basically I want a SQL connection to a csv file in a s3 bucket using Amazon Athena. I also do not know any information other than that the first row will give the names of the headers. Does anyone know any solution to this?
You have at least two ways of doing this. One is to examine a few rows of the file to detect the data types, then create a CREATE TABLE SQL statement as seen at the Athena docs.
If you know you are getting only strings and numbers (for example) and if you know all the columns will have values, it can be relatively easy to build it that way. But if types can be more flexible or columns can be empty, building a robust solution from scratch might be tricky.
So the second option would be to use the AWS Glue Catalog to define a crawler, which does exactly what I told you above, but automatically. It also creates the metadata you need in Athena, so you don't need to write the CREATE TABLE statement.
As a bonus, you can use that automatically catalogued data not only from Athena, but also from Redshift and EMR. And if you keep adding new files to the same bucket (every day, every hour, every week...) you can tell the crawl to pass again and rediscover the data in case the schema has evolved.
Within AWS Glue how do I deal with files from S3 that will change every week.
Example:
Week 1: “filename01072018.csv”
Week 2: “filename01142018.csv”
These files are setup in the same format but I need Glue to be able to change per week to load this data into Redshift from S3. The code for Glue uses native Python as the backend.
AWS Glue crawlers should be able to just find your CSV files as they are named without any configuration on your part.
For instance, my Kinesis stream produces files that have paths and names that look like these:
my_events_folder/2018/02/13/20/my-prefix-3-2018-02-13-20-18-28-112ab3f0-5794-4f77-9a84-83efafeecabc
my_events_folder/2018/02/13/20/my-prefix-2-2018-02-13-20-12-00-7f2efb62-827b-46a6-83c4-b4c52dd87d60
...
AWS Glue just finds these files and classifies them automatically. Hope this helps.
AWS Glue should be able to process all the files in a folder irrespective of the name in a single job. If you don’t want the old file to be processed again move it using boto3 api for s3 to another location after each run.
If you have two different TYPES of files (with different internal formats), they must be in separate folder hierarchies. There is no way to tell a crawler to only look for redfile*.csv and ignore bluefile%.csv. Instead user separate hierarchies like:
s3://my-bucket/redfiles/
redfile01072018.csv
redfile01142018.csv
...
s3://my-bucket/bluefiles/
bluefile01072018.csv
bluefile01142018.csv
...
Setup two crawlers, one crawling s3://my-bucket/redfiles/ and the other crawling s3://my-bucket/bluefiles/