I am porting a python project (s3 + Athena) from using csv to parquet.
I can make the parquet file, which can be viewed by Parquet View.
I can upload the file to s3 bucket.
I can create the Athena table pointing to the s3 bucket.
However, when I query the table at Athena Web GUI, it runs for 10 mins (it seems that it will never stop) and there is no result shown.
The whole project is complicated. I try to simplify the case.
1.Let say, we have the following csv file (test.csv)
"col1","col2"
"A","B"
2.Then, I use the following python (2.7) code to convert it to parquet file (test.parquet)
import fastparquet
import pandas as pd
df = pd.read_csv(r"test.csv")
fastparquet.write(r"test.parquet", df, compression="GZIP")
3.Upload the test.parquet to s3 bucket folder "abc_bucket/abc_folder" by the s3 Web GUI
4.Create the following table by Athena Web GUI
CREATE EXTERNAL TABLE IF NOT EXISTS abc_folder (
`col1` string,
`col2` string)
STORED AS PARQUET
LOCATION 's3://abc_bucket/abc_folder/'
TBLPROPERTIES (
"parquet.compress"="GZIP"
);
5.Finally, run the following SQL at Athena. The SQL runs for 10 mins and seems forever.
select *
from abc_folder;
My question is which step above is wrong so that I cannot query the table from Athena.
It is highly appreciated for any help.
Try to view your parquet data in S3 bucket itself with "Select From" option.
If its fine, then use Athena to create table of your parquet file with proper table column headers. Later preview table to view the content.
We can read parquet file in athena by creating a table for given s3 location.
CREATE EXTERNAL TABLE abc_new_table (
dayofweek INT,
flightdate STRING,
uniquecarrier STRING,
airlineid INT
)
PARTITIONED BY (flightdate STRING)
STORED AS PARQUET
LOCATION 's3://abc_bucket/abc_folder/'
tblproperties ("parquet.compression"="SNAPPY");
This assumes s3://abc_bucket/abc_folder/* directory has the parquet files compressed in SNAPPY format.
More details can be found in this AWS document.
Related
I have few parquet files stored in my storage account, which I am trying to read using the below code. However it fails with error as incorrect syntax. Can someone suggest to me as whats the correct way to read parquet files using azure databricks?
val data = spark.read.parquet("abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet")
display(data)
abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet
As per the above abfss URL you can use delta or parquet format in the storage account.
Note: If you created delta table, part file creates automatically like this part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet.As per above code it is not possible to read parquet file in delta format .
I have written the datafram df1 and overwrite into a storage account with parquet format.
df1.coalesce(1).write.format('parquet').mode("overwrite").save("abfss://<container>#<stoarge_account>.dfs.core.windows.net/<folder>/<sub_folder>")
Scala
val df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
python
df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
Output:
I am facing the same scenario. Let me explain the situation point. I have a glue ETL job
1--> My files are in parquet format and stored in AWS s3.
2--> I am iterating a loop for a data set where the same file name can occur with diffrent other data.
3--> I read the parquet and saved it in a pandas data frame.
4--> Done some operation on that data frame
5--> upload the updated data frame into the S3 parquet file. Below are code snippet what I am using to save the updated
data frame into parquet format and load into S3
header_name_column_list = dict(data_frame)
header_list = []
for col_id, col_type in header_name_column_list.items():
header_list.append(pyarrow.field(col_id, pyarrow.string()))
table_schema = pyarrow.schema(header_list)
table = pyarrow.Table.from_pandas(data_frame, schema=table_schema, preserve_index=False)
writer = parquet.ParquetWriter(b_buffer, table.schema)
writer.write_table(table) writer.close() b_buffer.seek(0) ..... ....
self.s3_client.upload_fileobj( b_buffer, self.bucket, file_key, ExtraArgs=extra_args)```
But when I executed the glue etl job, the first time it works properly and but in the next iteration, when I try to open the same file got that error.
INFO:Iot-dsip-de-duplication-job:Dataframe uploaded: s3://abc/2022/07/12/file1_ft_20220714122108.3065_12345.parquet INFO:Iot-dsip-de-duplication-job:Sleep for 60 sec
INFO:Iot-dsip-de-duplication-job:start after sleep
.......................
..........................
..........................
ERROR:Iot-dsip-de-duplication-job:Failed to read data from parquet file s3://abc/2022/07/12/file1_ft_20220714122108.3065_12345.parquet, error is : Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.INFO:Iot-dsip-de-duplication-job:Empty dataframe found
Any clue will be really helpful..I got stuck with this problem.
I have code running as AWS Lambda that queries internal database and generates files in different formats. The files are generated in parts and uploaded to S3 using multi-part upload:
self.mpu = self.s3_client.create_multipart_upload(
Bucket=self.bucket_name,
ContentType=self.get_content_type(),
Expires=self.expire_daytime,
Key=self.filename,
)
and
response = self.s3_client.upload_part(
Bucket=self.bucket_name,
Key=self.filename,
PartNumber=self.part_number,
UploadId=self.upload_id,
Body=data
)
self.current_part_info.update({
'PartNumber': self.part_number,
'ETag': response['ETag']
})
One of the formats I need to support is XLS or XLSX. It's fairly easy to create multiple CSV files on S3. But is it possible to combine them directly on S3 into XLS/XLSX without downloading them?
My current code generates an XLSX file in memory, creates a local file, and then uploads it to S3:
import xlsxwriter
self.workbook = xlsxwriter.Workbook(self.filename)
# download CSV files...
for sheet_name, sheet_info in sheets.items():
sheet = self.workbook.add_worksheet(name=sheet_name)
# code that does formatting
for... # loop through rows
for... # loop thru columns
sheet.write(row, col, col_str)
self.workbook.close()
This works fine for small queries, but the users will want to use it for a large amount of data.
When I run it with large queries, it runs out of memory. AWS Lambda has limited memory and limited disk space, and I'm hitting those limits.
Is it possible to combine CSV files into XLS or XLSX somehow without holding the entire file in local space (both memory and disk space are a problem)?
I am creating an external table in Redshift, pointing at a Parquet file stored in S3. The parquet file is created with pyarrow. When I SELECT * the external table defined below the "timestamp" works but the "anonymous_id" (VARCHAR) is null. The same applies to any VARCHAR.
CREATE EXTERNAL TABLE
propensity_identify
(anonymous_id VARCHAR(max),
timestamp timestamp without time zone)
PARTITIONED BY (loaded_at timestamp)
STORED AS PARQUET
LOCATION 's3://bucket/key'
TABLE PROPERTIES ('compression'='none', 'serialization.null.format'='')
The parquet schema is:
anonymousId: BYTE_ARRAY UTF8
timestamp: INT96
Any idea why that happens? STL_S3CLIENT_ERROR says:
S3ServiceException:HTTP/1.1 403 Forbidden,Status 403
Thank you very much for your help!
I am trying to load some Redshift query results to S3. So far I am using pandas_redshift but I got stuck:
import pandas_redshift as pr
pr.connect_to_redshift( dbname = 'dbname',
host = 'xxx.us-east- 1.redshift.amazonaws.com',
port = 5439,
user = 'xxx',
password = 'xxx')
pr.connect_to_s3(aws_access_key_id = 'xxx',
aws_secret_access_key = 'xxx',
bucket = 'dxxx',
subdirectory = 'dir')
And here is the data that I want to dump to S3:
sql_statement = '''
select
provider,
provider_code
from db1.table1
group by provider, provider_code;
'''
df = pr.redshift_to_pandas(sql_statement)
The df was created successfully but how to do the next step, which is to put this dataframe to S3?
The method you are looking at is very inefficient.
to do this the right way you will need a way to run sql on redshift - via e.g. python.
the following sql should be run
unload ('select provider,provider_code
from db1.table1
group by provider, provider_code;')
to 's3://mybucket/myfolder/unload/'
access_key_id '<access-key-id>'
secret_access_key '<secret-access-key>';
see here fore documentation.
As Jon Scott mentions if your goal is to move data from redshift to S3, then the pandas_redshift package is not the right method. The package is meant to allow you to easily move data from redshift to a Pandas DataFrame on your local machine, or move data from a Pandas DataFrame on your local machine to redshift. It is worth noting that running the command you already have:
df = pr.redshift_to_pandas(sql_statement)
Pulls the data directly from redshift to your computer without involving S3 at all. However this command:
pr.pandas_to_redshift(df, 'schema.your_new_table_name')
Copies the DataFrame to a CSV in S3, then runs a query to copy CSV to redshift (This step requires that you ran pr.connect_to_s3 successfully). It does not perform any cleanup of the S3 bucket so a side effect of this is that the data will end up in the bucket you specify.