I have tried everything I can find and think of, but cannot seem to get this code right.
I'm using Airflow, trying to run a SQL select statement, return the results, and upload them directly to s3 using a PythonCallable task.
I am unable to save the DataFrame as a csv locally, so that is not an option.
Ultimately, I keep looping back to this ERROR - Fileobj must implement read. The only "successful" attempts have produced empty results in my s3 file. I tried using the .seek(0) method that I found in another post, but then I got ERROR - Unicode-objects must be encoded before hashing. Anyway, below is my code. Any direction is enormously appreciated.
snow_hook = SnowflakeHook(
snowflake_conn_id='Snowflake_ETL_vault'
)
df = snow_hook.get_pandas_df(sql=sql)
with io.StringIO() as stream:
df.to_csv(stream)
stream.seek(0)
f = stream.getvalue()
s3_hook = S3Hook(aws_conn_id='s3_analytics')
s3_hook.load_file_obj(
file_obj=f,
bucket_name=bkt,
key=key,
replace=True
)
Edit: I have also tried f = stream.read() and still somehow get Fileobj must implement read.
Thanks!
I also faced the same issue and spent some time in understanding the nature of the issue.
The reason why you are getting ERROR - Fileobj must implement read is that file_obj expects stream object itself not stream.getvalue()
And pandas.to_csv having some encoding issues, you can find the issue details here
https://github.com/pandas-dev/pandas/issues/23854
The workaround is write bytes using load_bytes function from s3hook
with io.BytesIO() as buffer:
buffer.write(
bytes(
df.to_csv(None, sep="|", quotechar='"'),
encoding="utf-8"
)
)
hook.load_bytes(
buffer.getvalue(),
bucket_name="bucket_name",
key="keyname.csv",
replace=True
)
I am still looking for a better solution though
You can do it like through the load_string command
df = snow_hook.get_pandas_df(sql=sql)
csv_data_as_str = df.to_csv(index=False)
s3_hook.load_string(string_data=csv_data_as_str, key=s3_key, bucket_name=s3_bucket, replace=True)
Related
I am trying to decompress a file which is located on S3 from a lambda (Python). Everything works great until the original file is over 2GB, at that point, I only get "That compression method is not supported".
I have mainly tried with ZipFile and and ZipFile39 and no luck.... there are a couple of other packages but I got similar results.
When try to unzip the content using ZipFile
# Using ZipFile -- No problems here... (I have tried with/without
# compression, compresslevel, allowZip64 as well as different values and
# same output)
zip_content = zipfile.ZipFile(zip_content,
'r',
compression=8, # There is no 9 here
compresslevel=9,
allowZip64=True)
for filename in zip_content.namelist():
print(zip_content.getinfo(filename))
# Printing the zip_info I get: <ZipInfo filename='test_file.csv'
# compress_type=deflate64 external_attr=0x20 file_size=2505399449
# compress_size=853276056>
when try to unzip the content using ZipFile39
# Using ZipFile39 -- No problems here... (I have tried with/without compression,
# compresslevel, allowZip64 as well as different values and same output)
# But something interesting is that I cannot use 'ZIP_DEFLATED64' as it says that
# attribute cannot be found on ZipFile39, but it is there (using 9 also fails).
zip_content = zipfile39.ZipFile(zip_content,
'r',
compression=9,
compresslevel=9,
allowZip64=True)
for filename in zip_content.namelist():
print(zip_content.getinfo(filename))
# Printing the zip_info I get: <ZipInfo filename='test_file.csv'
# compress_type=deflate64 external_attr=0x20 file_size=2505399449
# compress_size=853276056>
The exception comes when trying:
# Writing to S3 <<---- HERE is where the exception occurs, always:
# "That compression method is not supported"
# That is happening on the "open" zip_content.open (I have tried with the
# force_zip64 and without it)
zip_content.open(zip_info, force_zip64=True)
I have seen other question about the same topic but I can't find an answer I can make it work so far, I have tried zipfile-deflate64 (got a cyclical reference error), stream_inflate (did not decompressed the file), stream_unzip (did not decompressed the file).
Few important notes:
This is happening only on the AWS Lambda (locally works without issues - windows laptop).
It is not about the memory/space on the lambda, it is not used more than 20% at this point.
Any idea/help/suggestion will be appreciated.
Thanks
With the following code :
import pandas as pd
filename = r"/path/to/my/file.csv"
with open(filename) as f:
data_frame = pd.read_csv(f,
usecols=['col1', 'col2'],
parse_dates=['DateProd', 'DateStart', 'DateEnd'],
header=0,
delimiter=';',
encoding='latin-1')
print(data_frame)
When this is executed locally, it prints the expected dataframe with proper accentuation. When executed in an airflow task on a remote worker, it fails with the following error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2581: invalid continuation byte
When looking at the full debug strack trace (sorry, sensitive information, can't fully provide), I see within the stack trace that encoding='latin-1' was definitely specified, and it still fails with the UnicodeDecodeError error. Anybody noticed a similar issue? I've been running in circles, trying as many encoding as possible, but nothing seems to work.
I forgot to mention that the file is a remote file on a samba share. Wether I try to read it directly with smbclient.open() or copy it over locally and then opening it, I get the same result : UnicodeDecodeError. When I try the same thing locally (both reading directly from the samba share, or copying it over), everything seems fine, and I noticed I don't even need to specify the encoding, it will find it automatically, and accents will be displayed properly.
Another update : It seems that wether the file is read from the samba share or not does not make a difference. I managed to run the docker image that is used on the remote worker, and I can reproduce this issue with everything hosted locally, wether I open the file before, wether I read it entirely before giving it to pandas, or wether I simply give the filename to read_csv.
The engine does not seem to make a difference either : specifying engine='python' or engine='c' yield the same results.
Another update : It appears that this same issue is also happening with a fresh ubuntu docker image. I'm guessing there is some locales that need to be installed before it is able to parse them.
I've figured it out.
On a Windows machine, the encoding seems to be different by default. I don't even have to specify an encoding for it to work. Not in the container. Thus, I need to specify the encoding when opening files in the containers. The following should work :
import pandas as pd
filename = r"/path/to/my/file.csv"
with open(filename, encoding='latin-1') as f:
data_frame = pd.read_csv(f,
usecols=['col1', 'col2'],
parse_dates=['DateProd', 'DateStart', 'DateEnd'],
header=0,
delimiter=';',)
# Notice the lack of encoding='latin-1' here.
print(data_frame)
But! SambaHook essentially returns a pysmbclient's SambaClient. When you try to open a file with this Sambaclient, there is no way to specify the encoding of the file. So, locally, on a windows machine, everything seems to work fine. In a linux container, it fails with the UnicodeDecodeError. Looking under the hood, I've found that it essentially copies the file over before simply calling open() on the file.
For now, this is my solution : Copy the file over with the SambaClient returned by SambaHook in a temp file, open it with the proper encoding, ask panda to parse it. I will see what I can do about improving SambaHook and pysmbclient so that others can specify the encoding when opening a file.
Apologies in advance since I'm new to Django (and I've also had to freshen up my Python skills). I'm trying to make a simple example of uploading a file through a form and then printing the rows in my terminal (as a test before doing some actual processing). My views.py contains the following:
def upload_csv(request):
if "GET" == request.method:
return render(request, "portal/csvupload.html")
csv_file = request.FILES["csv_file"]
handle_files(csv_file)
return HttpResponseRedirect(reverse("portal:csvupload"))
def handle_files(csvfile):
csv_reader = csv.reader(csvfile)
for line in csv_reader:
print(line)
Now this returns an error message saying "expected str, bytes or os.PathLike object, not InMemoryUploadedFile", and I'm unsure what's wrong with the code based on the error message? From a Python perspective it looks fine I think, but perhaps it's something to do with the re-direct? Apperciate all answers
request.FILES["csv_file"] is returning an InMemoryUploadedFile object and csv.reader does not know how to handle such an object. I believe you need to call the object's read method: handle_files(csv_file.read()). Note the warning in the documentation: "Be careful with this method: if the uploaded file is huge it can overwhelm your system if you try to read it into memory. You’ll probably want to use chunks() instead; see below."
I am trying to read only one file from a tar.gz file. All operations over tarfile object works fine, but when I read from concrete member, always StreamError is raised, check this code:
import tarfile
fd = tarfile.open('file.tar.gz', 'r|gz')
for member in fd.getmembers():
if not member.isfile():
continue
cfile = fd.extractfile(member)
print cfile.read()
cfile.close()
fd.close()
cfile.read() always causes "tarfile.StreamError: seeking backwards is not allowed"
I need to read contents to mem, not dumping to file (extractall works fine)
Thank you!
The problem is this line:
fd = tarfile.open('file.tar.gz', 'r|gz')
You don't want 'r|gz', you want 'r:gz'.
If I run your code on a trivial tarball, I can even print out the member and see test/foo, and then I get the same error on read that you get.
If I fix it to use 'r:gz', it works.
From the docs:
mode has to be a string of the form 'filemode[:compression]'
...
For special purposes, there is a second format for mode: 'filemode|[compression]'. tarfile.open() will return a TarFile object that processes its data as a stream of blocks. No random seeking will be done on the file… Use this variant in combination with e.g. sys.stdin, a socket file object or a tape device. However, such a TarFile object is limited in that it does not allow to be accessed randomly, see Examples.
'r|gz' is meant for when you have a non-seekable stream, and it only provides a subset of the operations. Unfortunately, it doesn't seem to document exactly which operations are allowed—and the link to Examples doesn't help, because none of the examples use this feature. So, you have to either read the source, or figure it out through trial and error.
But, since you have a normal, seekable file, you don't have to worry about that; just use 'r:gz'.
In addition to the file mode, I attempted to seek on a network stream.
I had the same error when trying to requests.get the file, so I extracted all to a tmp directory:
# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:
tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
with open(os.path.join(t, fn)) as payload:
print(payload.read())
I have followed the tutorial posted here in order to get AJAX file uploads on my Django app. The thing is that it doesn't work, and the closest I could get to the issue is finding out that the save_upload() method raises the following exception: 'WSGIRequest' object has no attribute 'read'. Any ideas on what I am doing wrong?
EDIT: I figured out that this is only works in Django 1.3. Any ideeas on how to make it work in Django 1.2?
I think I have gotten to the bottom of your problem.
1) You are trying to run .read() on a request object, which is not allowed. Instead, you need to run it on request.raw_post_data.
2) Before you can run .read(), which takes a file-like object, you need to convert it from a str to a file-like object.
Try this:
import StringIO
output = StringIO.StringIO()
output.write(request.raw_post_data)
...now you'll be able to run output.read() and get the data you want.
#loop through, writing more of the file each time
file_so_far = output.read( 1024 ) #Get ready....
while file_so_far: #..get set...
dest.write( file_so_far ) #Go.
file_so_far = output.read( 1024