Within AWS Glue how do I deal with files from S3 that will change every week.
Example:
Week 1: “filename01072018.csv”
Week 2: “filename01142018.csv”
These files are setup in the same format but I need Glue to be able to change per week to load this data into Redshift from S3. The code for Glue uses native Python as the backend.
AWS Glue crawlers should be able to just find your CSV files as they are named without any configuration on your part.
For instance, my Kinesis stream produces files that have paths and names that look like these:
my_events_folder/2018/02/13/20/my-prefix-3-2018-02-13-20-18-28-112ab3f0-5794-4f77-9a84-83efafeecabc
my_events_folder/2018/02/13/20/my-prefix-2-2018-02-13-20-12-00-7f2efb62-827b-46a6-83c4-b4c52dd87d60
...
AWS Glue just finds these files and classifies them automatically. Hope this helps.
AWS Glue should be able to process all the files in a folder irrespective of the name in a single job. If you don’t want the old file to be processed again move it using boto3 api for s3 to another location after each run.
If you have two different TYPES of files (with different internal formats), they must be in separate folder hierarchies. There is no way to tell a crawler to only look for redfile*.csv and ignore bluefile%.csv. Instead user separate hierarchies like:
s3://my-bucket/redfiles/
redfile01072018.csv
redfile01142018.csv
...
s3://my-bucket/bluefiles/
bluefile01072018.csv
bluefile01142018.csv
...
Setup two crawlers, one crawling s3://my-bucket/redfiles/ and the other crawling s3://my-bucket/bluefiles/
Related
I just linked an Azure storage account (Storage gen2) with its underlying containers to my Databricks environment. Inside the storage account are two containers each with some subdirectories. Inside the folders are .csv files.
I have connected an Azure service principal with Azure Blog Data Contributor access to the storage account inside databricks so I can read and write to the storage account.
I am trying to figure out the best way to convert the existing storage account into a delta lake (tables inside the metastore + convert the files to parquet (delta tables).
What is the easiest way to do that?
My naive approach as a beginner might be
Read the folder using
spark.read.format("csv).load("{container}#{storage}..../directory)
Write to a new folder with similar name (so if folder is directory, write it to directory_parquet) using df.write.format("delta").save({container}#{storage}.../directory_parquet)
And then not sure on the last steps? This would create a new folder with a new set of files. But it wouldn't be a table in databricks that shows up in the hive store. But I do get parquet files.
Alternatively I can use df.write.format().saveAsTable("tablename") but that doesn't create the table in the storage account, but inside the databricks file system, but does show up in the hive metastore.
delete the existing data files if desired (or have it duplicated)
Preferably this can be done in a Databricks workbook using python as preferred, or scala/sql if necessary.
*As a possible solution, if the efforts to do this are monumental, just converting to parquet and getting table information for each subfolder into hive storage as a format of database=containerName
tableName=subdirectoryName
The folder structure is pretty flat at the moment, so only rootcontainer/Subfolders deep.
Perhaps an external table is what you're looking for:
df.write.format("delta").option("path", "some/external/path").saveAsTable("tablename")
This post has more info on external tables vs managed tables.
I would like to develop a WebApp that parses locally stored data and allows users to create a sorted excel file.
It would be amazing if I could somehow avoid the uploading of the files. Some users are worried because of the data, the files can get really big so I have to implement async processes and so on...
Is something like that possible?
I am trying to upload a file to a subfolder in S3 in lambda function.
Any suggestion for achieving this task. Currently I am able to upload to only the main S3 bucket folder.
s3_resource.Bucket("bucketname").upload_file("/tmp/file.csv", "file.csv")
However, my goal is to upload to a folder in bucketname/subfolder1/file.csv
Thanks in advance
Amazon has this to say about S3 paths:
In Amazon S3, buckets and objects are the primary resources, and objects are stored in buckets. Amazon S3 has a flat structure instead of a hierarchy like you would see in a file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. It does this by using a shared name prefix for objects (that is, objects have names that begin with a common string). Object names are also referred to as key names.
In other words, you just need to specify the path you want to use for the upload, any directory concept only impacts how objects are enumerated and displayed, there isn't a directory you need to make like a traditional filesystem:
s3_resource.Bucket("bucketname").upload_file("/tmp/file.csv", "subfolder1/file.csv")
Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.
When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?
Thanks!
You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.
However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.
When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:
from pyspark.sql.functions import col, input_file_name
df.withColumn("filename", input_file_name)
.where(col("filename") == "your-filename")
If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/) so that you could benefit from using pushdown predicates in AWS Glue
I followed s3 documentation and example for uploading my large file as multipart to amazon s3 storages with using a boto3 library. However, the information in official documentation was not sufficient for me. In my case, I have multiple storages in different regions, and I should upload my parts (overall, 500 GB) to different storages. I would like to see an example, how to upload multiple parts to different S3 storages in different zones with using the boto3 library. Any information is valuable for me. Thank you for reading
boto3 documentaion
boto3 example
Okay, you seem to be confused about some of the terminology. Let's define some terms:
Region: A geographic location containing multiple Availability Zones, eg London, Ohio
Availability Zone: A data centre (or multiple). Each AZ in a region is geographically separated to ensure High Availability.
Regional service: A few services, including Amazon S3 is a regional service. This means it automatically copies data amongst the AZs.
Bucket: Storage location for objects in Amazon S3. A bucket exists in only one region.
Object: A file in a bucket.
Now, to answer your questions:
An object exists in one bucket only. An object cannot be split between buckets. (And since a bucket exists i na region, it cannot also be split between regions.)
If you wish to replicate objects between regions, use Cross-Region Replication. This will automatically copies objects from one bucket to another.
The maximum size of an object in Amazon S3 is 5TB, but you really, really don't want to get that large. Most big data applications use many, smaller files (eg 5MB in size). This allows parallel loading across multiple processes, which is commonly done in Hadoop. It also allows the additional of new data by simply adding files, rather than updating an existing file. (By the way, you cannot append to an S3 object, you can only replace it.)
The easiest way to upload data to S3 is by using the AWS Command-Line Interface (CLI).
Multi-part uploads are merely a means of uploading a single object by splitting it into multiple parts, uploading each part, then stitching them together. The method of upload is unrelated to the actual objects once they have been uploaded.
You should always store your data close to where it is being processed. So, only replicate data between regions if it needs to be processed in multiple regions.