How to read all parquet files from a s3 bucket

How to read all parquet files from a s3 bucket - python

I currently have an s3 bucket that has folders with parquet files inside. I want to read all the individual parquet files and concatenate them into a pandas dataframe regardless of the folder they are in.
I am trying the following code:
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
pandas_dataframe = pq.ParquetDataset('s3://vivienda-test/2022/11', filesystem=s3).read_pandas().to_pandas()
print(pandas_dataframe)
I realize that it only works for concatenation the parquets of a specific folder of the bucket and it also gives me the following error:
Traceback (most recent call last):
File "/Users/Documents/inf.py", line 5, in <module>
pandas_dataframe = pq.ParquetDataset('s3://vivienda-test/2022/11', filesystem=s3).read_pandas().to_pandas()
File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 1790, in __init__
self.validate_schemas()
File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 1824, in validate_schemas
self._schema = self._pieces[0].get_metadata().schema
File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 1130, in get_metadata
f = self.open()
File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 1137, in open
reader = self.open_file_func(self.path)
File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 1521, in _open_dataset_file
return ParquetFile(
File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 286, in __init__
self.reader.open(
File "pyarrow/_parquet.pyx", line 1227, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet file size is 0 bytes
can someone help me?, thanks

You can use the aws wrangler api's to achieve the same.
https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html
#Reading all Parquet files under a prefix
import awswrangler as wr
df = wr.s3.read_parquet(path='s3://bucket/prefix/')

Related

Python assertion error - reading xls from URL

I am trying to read an xls file from a url into python dataframe. However I am getting below assertion error.
Traceback (most recent call last):
File "c:\Sample_project\venv\excel_read_v1.py", line 17, in
df = pd.read_excel("file.xls",
File "C:\Sample_project\venv\lib\site-packages\pandas\util_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_base.py", line 457, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_base.py", line 1419, in init
self._reader = self._engines[engine](self.io, storage_options=storage_options)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_xlrd.py", line 25, in init
super().init(filepath_or_buffer, storage_options=storage_options)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_base.py", line 518, in init
self.book = self.load_workbook(self.handles.handle)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_xlrd.py", line 38, in load_workbook
return open_workbook(file_contents=data)
File "C:\Sample_project\venv\lib\site-packages\xlrd_init.py", line 172, in open_workbook
bk = open_workbook_xls(
File "C:\Sample_project\venv\lib\site-packages\xlrd\book.py", line 104, in open_workbook_xls
bk.parse_globals()
File "C:\Sample_project\venv\lib\site-packages\xlrd\book.py", line 1211, in parse_globals
self.handle_sst(data)
File "C:\Sample_project\venv\lib\site-packages\xlrd\book.py", line 1178, in handle_sst
self._sharedstrings, rt_runlist = unpack_SST_table(strlist, uniquestrings)
File "C:\Sample_project\venv\lib\site-packages\xlrd\book.py", line 1472, in unpack_SST_table
assert _unused_i == nstrings - 1
AssertionError
I read some other suggestions on stackoverflow that if I remove the last few empty lines from the excel then it would work. So I tried that out by downloading the file in a local folder , removing the last 2 empty rows and then reading the file from the local folder, this works. But i need the code to somehow able to handle it while reading from the url so that we can automate the process
I have tried using openpyxl and xlrd to read the file.
---Code snapshot below--------
import openpyxl
import xlrd
from xlrd import open_workbook
import requests
import pandas as pd
url = url
r = requests.get(url)
with open('maskefile.xls', 'wb') as output:
output.write(r.content)
df = pd.read_excel("maskedfile.xls",sheet_name = "maskedsheetname")
df.to_csv("C:\Sample_project\maskedfile.csv" ,index = False)

Pandas and glob: convert all xlsx files in folder to csv – TypeError: init() got an unexpected keyword argument 'xfid'

I have a folder with many xlsx files that I'd like to convert to csv files.
During my research, if found several threads about this topic, such as this or that one. Based on this, I formulated the following code using glob and pandas:
import glob
import pandas as pd
path = r'/Users/.../xlsx files'
excel_files = glob.glob(path + '/*.xlsx')
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel) # error occurs here
df.to_csv(out)
But unfortunately, I got the following error message that I could not interpret in this context and I could not figure out how to solve this problem:
Traceback (most recent call last):
File "<input>", line 11, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1131, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 475, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 391, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 486, in load_workbook
return load_workbook(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 317, in load_workbook
reader.read()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 281, in read
apply_stylesheet(self.archive, self.wb)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/styles/stylesheet.py", line 198, in apply_stylesheet
stylesheet = Stylesheet.from_tree(node)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/styles/stylesheet.py", line 103, in from_tree
return super(Stylesheet, cls).from_tree(node)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 87, in from_tree
obj = desc.expected_type.from_tree(el)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 87, in from_tree
obj = desc.expected_type.from_tree(el)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 103, in from_tree
return cls(**attrib)
TypeError: __init__() got an unexpected keyword argument 'xfid'
Does anyone know how to fix this? Thanks a lot for your help!

I had the same problem here. After some hours thinking and searching I realized the problem is, actually, the file. I opened it using MS Excel, and save. Alakazan, problem solved.
The file was downloaded, so i think it's a "security" error or just an error from how the file was created. xD
EDIT:
It's not a security problem, but actually an error from the generation of file. The correct has the double of kb the wrong file.
An solution is: if using xlrd==1.2.0 the file can be opened, you can, after doing this, call read_excel to the Book(file opened by xlrd).
import xlrd
# df = pd.read_excel('TabelaPrecos.xlsx')
# The line above is the same result
a = xlrd.open_workbook('TabelaPrecos.xlsx')
b = pd.read_excel(a)

pandas reading excel results in "not a zip file"

I try to read a xlsx into a data frame:
itut_ir = pd.read_excel('C:\\Users\\Administrator\\Downloads\\reportdata.xlsx')
print(itut_ir.to_string())
I receive this:
Traceback (most recent call last):
File
"C:\Users\Administrator\eclipse-workspace\Reports\GOW\Report.py",
line 44, in
df = pd.read_excel('C:\Users\Administrator\Downloads\reportdata.xlsx')
File
"C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_base.py",
line 304, in read_excel
io = ExcelFile(io, engine=engine) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_base.py",
line 824, in init
self._reader = self.enginesengine File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_xlrd.py",
line 21, in init
super().init(filepath_or_buffer) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_base.py",
line 353, in init
self.book = self.load_workbook(filepath_or_buffer) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_xlrd.py",
line 36, in load_workbook
return open_workbook(filepath_or_buffer) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd_init.py",
line 117, in open_workbook
zf = zipfile.ZipFile(filename) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\zipfile.py",
line 1222, in init
self._RealGetContents() File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\zipfile.py",
line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file
does anybody have an idea? the file does not seem to be broken, I can open it with Excel.
thanks!
*** UPDATE ***
the file producing the error is being downloaded from FTP. opening the original file works ... if that gives you a hint :) thanks

I had the same issue just a little bit ago with an XLSX that I created in LibreOffice.
The solution was to check the XLSX to make sure it wasn't corrupted. In my case, loading a previous version of the XLSX file corrected the problem.

Using Pandas to read excel from url

I am working on a personal project to analyze COVID19 data. Presently, I am download the excel sheet provided by ourworldindata.org, available at this url -> https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.xlsx
However, when i try to execute the command in pandas (below), I get a list of errors. What could be the root cause ?
url = 'https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.xlsx'
df = pd.read_excel(url, sheet_name='Sheet1')
Error
Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\pandas\io\excel\_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine) File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\pandas\io\excel\_base.py", line 824, in __init__
self._reader = self._engines[engine](self._io) File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\pandas\io\excel\_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer) File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\pandas\io\excel\_base.py", line 351, in __init__
self.book = self.load_workbook(filepath_or_buffer) File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\pandas\io\excel\_xlrd.py", line 34, in load_workbook
return open_workbook(file_contents=data) File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\xlrd\__init__.py", line 157, in open_workbook
ragged_rows=ragged_rows, File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\xlrd\book.py", line 92, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\xlrd\book.py", line 1278, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8]) File "C:\Users\masoom.kumar\PycharmProjects\ReadingINCA_Data\venv\lib\site-packages\xlrd\book.py", line 1272, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg) xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\n\n\n\n\n<!D'
Please not that pandas can read the excel if I download it on my computer

Try the link to raw excel file:
import pandas as pd
url='https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.xlsx?raw=true'
df=pd.read_excel(url, sheet_name='Sheet1')

You can do it with requests
import pandas as pd
import io
import requests
url = 'https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.xlsx'
get_content = requests.get(url).content
df = pd.read_csv(io.StringIO(get_content .decode('utf-8')))
I do this to avoid using local drive or google drive , and saves time of connection.

How to load multiple .mat files into a python script

I want to load 38 .mat files into a dictionary to hold them all.
the .mat files are named subject1 to subject38
The code I tried is a simple for loop
import scipy.io as sio
data = {}
for i in range(1, 38):
data["data{}".format(i)] = sio.loadmat('subject{}.mat'.format(i))
the error I'm getting is:
Traceback (most recent call last):
File "D:/senior project/python/dataAqu.py", line 7, in
data["data{0}".format(i)] = sio.loadmat('subject{0}.mat'.format(i))
File "C:\Users\mamdo\AppData\Roaming\Python\Python27\site-packages\scipy\io\matlab\mio.py", line 208, in loadmat
matfile_dict = MR.get_variables(variable_names)
File "C:\Users\mamdo\AppData\Roaming\Python\Python27\site-packages\scipy\io\matlab\mio5.py", line 292, in get_variables
res = self.read_var_array(hdr, process)
File "C:\Users\mamdo\AppData\Roaming\Python\Python27\site-packages\scipy\io\matlab\mio5.py", line 252, in read_var_array
return self._matrix_reader.array_from_header(header, process)
File "mio5_utils.pyx", line 675, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header
File "mio5_utils.pyx", line 705, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header
File "mio5_utils.pyx", line 778, in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex
File "mio5_utils.pyx", line 450, in scipy.io.matlab.mio5_utils.VarReader5.read_numeric
File "mio5_utils.pyx", line 355, in scipy.io.matlab.mio5_utils.VarReader5.read_element
File "streams.pyx", line 194, in scipy.io.matlab.streams.ZlibInputStream.read_string
File "pyalloc.pxd", line 9, in scipy.io.matlab.pyalloc.pyalloc_v
MemoryError

So I found the problem. The mat files shouldnt be opened by any other program - like matlab - if there is an error restart the computer.
Also if there is a memory problem try to integrate the mat files seperatly and perform whatever code you need and then load the next file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read all parquet files from a s3 bucket - python

You can use the aws wrangler api's to achieve the same. https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html #Reading all Parquet files under a prefix import awswrangler as wr df = wr.s3.read_parquet(path='s3://bucket/prefix/')

Related

Python assertion error - reading xls from URL

Pandas and glob: convert all xlsx files in folder to csv – TypeError: init() got an unexpected keyword argument 'xfid'

pandas reading excel results in "not a zip file"

Using Pandas to read excel from url

How to load multiple .mat files into a python script

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read all parquet files from a s3 bucket - python

You can use the aws wrangler api's to achieve the same. https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html #Reading all Parquet files under a prefix import awswrangler as wr df = wr.s3.read_parquet(path='s3://bucket/prefix/')

Related

Python assertion error - reading xls from URL

Pandas and glob: convert all xlsx files in folder to csv – TypeError: __init__() got an unexpected keyword argument 'xfid'

pandas reading excel results in "not a zip file"

Using Pandas to read excel from url

How to load multiple .mat files into a python script

Categories

Resources

Pandas and glob: convert all xlsx files in folder to csv – TypeError: init() got an unexpected keyword argument 'xfid'