For some reason, when I attempt to read a hdf file from S3 using the pandas.read_hdf() method, I get a FileNotFoundError when I put an s3 url. The file definitely exists and I have tried using the pandas.read_csv() method with a csv file in the same s3 directory and that works. Is there something else I need to be doing? Here's the code:
import boto3
import h5py
import s3fs
import pandas as pd
csvDataframe = pd.read_csv('s3://BUCKET_NAME/FILE_NAME.csv', key='df')
print("Csv data:")
print(csvDataframe)
dataframe = pd.read_hdf('s3://BUCKET_NAME/FILE_NAME.h5', key='df')
print("Hdf data:")
print(dataframe)
Here is the error:
FileNotFoundError: File s3://BUCKET_NAME/FILE_NAME.h5 does not exist
In the actual code, BUCKET_NAME and FILE_NAME are replaced with their actual strings.
Please make sure file extension is .h5
Related
I have created a script which dumps the excel sheets stored in S3 into my local postgres database. I've used pandas read_excel and ExcelFile method to read the excel sheets.
Code for the same can be found here.
import boto3
import pandas as pd
import io
import os
from sqlalchemy import create_engine
import xlrd
os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxxxxxxxxxxxx"
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket-name', Key='file.xlsx')
data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
print(data.sheet_names)
a = len(data.sheet_names)
engine1 = create_engine('postgresql://postgres:postgres#localhost:5432/postgres')
for i in range(a):
df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i], engine='openpyxl')
df.to_sql("test"+str(i), engine1, index=False)
Basically, code parses the S3 bucket and runs in a loop. For each sheet, it creates a table
and dumps the data from sheet in that table.
Where I'm having trouble is, when I run this code, I get this error.
df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i-1], engine='openpyxl')
zipfile.BadZipFile: File is not a zip file
This is coming after I added 'openpyxl' engine in read_excel method. When I remove the engine, I get this error.
raise ValueError(
ValueError: Excel file format cannot be determined, you must specify an engine manually.
Please note that I can print the connection to database, so there is no problem in connectivity, and I'm using latest version of python and pandas. Also, I can get all the sheet_names in the excel file so I'm able to reach to that file as well.
Many Thanks!
You are reading the obj twice, fully:
data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
pd.read_excel(io.BytesIO(obj['Body'].read()), ...)
Your object can only be .read() once, second read produce nothing, an empty b"".
In order to avoid re-reading the S3 stream many times, you could store it once in a BytesIO, and rewind that BytesIO with seek.
buf = io.BytesIO(obj["Body"].read())
pd.ExcelFile(buf)
buf.seek(0)
pd.read_excel(buf, ...)
# repeat
I'm trying to extract a data from BQ and save it to CSV and then upload it to s3, but I'm having error with the uploading to s3. This is the error I get when I run the script:
raise ValueError('Filename must be a string')
If you can please help me solve this issue, I'm new to Python and AWS. Thank you
Script is:
rows_df = query_job.result().to_dataframe()
file_csv = rows_df.to_csv(s3_filename, sep='|', index=False, encoding='utf-8')
s3.upload_file(file_csv, s3_bucket, file_csv)
Try changing the arguments passed to s3.upload_file like so:
s3.upload_file(s3_filename, s3_bucket, s3_filename)
The to_csv writes the dataframe to a local file at path s3_filename and file_csv is None. Alternatively, if your dataframe is small enough to be held in memory, the following should do the trick:
import io
data = rows_df.to_csv(sep='|', index=False, encoding='utf-8')
data_buffer = io.BytesIO(data)
s3.upload_fileobj(data_buffer, s3_bucket, s3_filename)
Based on pandas doc, to_csv returns None when path_or_buf is specified. However, upload_file needs a filename and a S3 key in its first and third argument respectively. Therefore, something like this could make this work.
s3.upload_file(s3_filename, s3_bucket, s3_filename)
TSV(Tab separated Value) extension file can't be uploaded to google colab using pandas
Used this to upload my file
import io
df2 = pd.read_csv(io.BytesIO(uploaded['Filename.csv']))
import io
stk = pd.read_csv(io.BytesIO(uploaded['train.tsv']))
A tsv file should be uploaded and read into the dataframe stk
import pandas as pd
from google.colab import files
import io
#firstly, upload file to colab
uploaded = files.upload()
#secondly, get path to file in colab
file_path = io.BytesIO(uploaded['file_name.tsv'])
#the last step is familiar to us
df = pd.read_csv(file_path, sep='\t', header=0)
To save tsv file on google colab, .to_csv function can be used as follows:
df.to_csv('path_in_drive/filename.tsv', sep='\t', index=False, header=False)
stk = pd.read_csv('path_in_drive/filename.tsv') #to read the file
Don't know if this is a solution to your problem as it doesn't upload the files, but with this solution you can import files that are also on your google drive.
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive/My\ Drive/{'.//'}
After mounting you should be able to load files into your script like on your desktop
I am doing a data science project.
I am using google notebook for my job
My dataset is residing at here which I want to access directly at python Notebook.
I am using following line of code to get out of it.
df = pd.read_csv('link')
But Command line is throwing an error like below
What should I do?
Its difficult to answer exactly as there lack of data but here you go for this kind of request..
you have to import ZipFile & urlopen in order to get data from url and extract the data from Zip and the use the csv file for pandas processings.
from zipfile import ZipFile
from urllib.request import urlopen
import pandas as pd
import os
URL = 'https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip'
# open and save the zip file onto computer
url = urlopen(URL)
output = open('05d2b4ea-c-Dataset.zip', 'wb') # note the flag: "wb"
output.write(url.read())
output.close()
# read the zip file as a pandas dataframe
df = pd.read_csv('05d2b4ea-c-Dataset.zip') zip files
# if keeping on disk the zip file is not wanted, then:
os.remove(zipName) # remove the copy of the zipfile on disk
Use urllib module to download into memory the zip file which returns a file-like object that you can read(), pass it to ZipFile(standard package).
Since here there are multiple files like
['test_data/AggregateData_Test.csv', 'test_data/TransactionData_Test.csv', 'train_data/AggregateData_Train.csv', 'train_data/Column_Descriptions.xlsx', 'train_data/sample_submission.csv', 'train_data/TransactionData_Train.csv']
Load it to a dict of dataframes with filename as the key. Altogether the code will be.
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO
zip_in_memory = urlopen("https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip").read()
z = ZipFile(BytesIO(zip_in_memory))
dict_of_dfs = {file.filename: pd.read_csv(z.open(file.filename))\
for file in z.infolist()\
if file.filename.endswith('.csv')}
Now you can access dataframes of each csv like dict_of_dfs['test_data/AggregateData_Test.csv'].
Ofcourse all of this is unnecessary if you will just download the zip from the link and pass it as a zipfile.
I want to download a CSV file stored in Azure storage into a stream and directly used in my python script, but after I did this with help from Thomas, I cannot use pandas read_csv method, the error message is: pandas.io.common.EmptyDataError: No columns to parse from file,thus I assume the download CSV stream is actually empty, but after check in storage account, the CSV file is fine with all data inside it, what the problem here? below is the code from Thomas:
from azure.storage.blob import BlockBlobService
import io
from io import BytesIO, StringIO
import pandas as pd
from shutil import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my account', account_key='mykey')
block_blob_service.get_blob_to_stream('my counter', 'datatest1.csv', input_blob)
df=pd.read_csv(input_blob)
print(df)
copyfileobj(input_blob, output_blob)
#print(output_blob)
# Create the a new blob
block_blob_service.create_blob_from_stream('my counter', 'datatest2.csv', output_blob)
if i dont execute the read_csv code, the create_blob_from_stream will create a empty file, but if i execute the read_csv code, i got error:
pandas.parser.TextReader.cinit (pandas\parser.c:6171)
pandas.io.common.EmptyDataError: No columns to parse from file
the download file stored fine in the blob storage with all data in it. as showing below:
i finally figure out, after spend so many time on this !
have to EXECUTE :
input_blob.seek(0)
to use the stream after save the stream to input_blob !!