How to collect data from different subdirectories containing excel files

How to collect data from different subdirectories containing excel files - python

I have a main directory containing multiple subdirectories which again have excel files. I want to loop through the directories and read the excel files into a pandas dataframe and make a collated dataframe containing the data from all the excel files.
The code that I've written so far gets the excel files from the subdirectories but I am unable to get them into a dataframe. Can someone help me with the same. The code that I've written so far is as follows:
import os
import pandas as pd
fin = pd.DataFrame()
rootdir = 'C:\\Divyam Projects\\ISB Work\\NHB Data'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
df = pd.read_csv(file)
fin.append(df)
print(fin)
In the above code I'm trying to declare a dataframe 'fin' and then trying to append it with the information from different excel files. The code is giving an error:
Traceback (most recent call last):
File "c:\Divyam Projects\ISB Work\NHB Data\main.py", line 8, in <module>
df = pd.read_csv(file)
File "C:\Users\divya\anaconda3\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\divya\anaconda3\lib\site-packages\pandas\io\parsers.py", line 454, in _read
data = parser.read(nrows)
File "C:\Users\divya\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "C:\Users\divya\anaconda3\lib\site-packages\pandas\io\parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 3
The sample header for the excel file is

Related

ParserError when using Pandas script

I'm getting the below error when trying to convert csx to xlx format using Pandas script.
I tried running the below script:
import os
os.chdir("/opt/alb_test/alb/albt1/Source/alb/al/conversion/scripts")
# Reading the csv file
import pandas as pd
print(pd.__file__)
df_new = pd.read_csv("sourcefile.csv", sep="|", header=None).dropna(axis=1, how="all")
# saving xlsx file
df_new.to_excel("sourcefile.xlsx", index=False)
I am getting the error mentioned below:
Traceback (most recent call last):
File "/opt/alb_test/alb/albt1/Source/alb/al/conversion/scripts/pythn.py", line 13, in <module>
df = pd.read_csv("ff_mdm_reject_report.csv", lineterminator='\n')
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read
return parser.read(nrows)
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 30488, saw 2
Can anyone guide me how to fix it?
Thanks in advance!

pandas read_excel from ODS file locked by another user

I'm trying to retrieve csv-formatted data with pandas from a .ods file on a shared folder (mounted using nfs on my machine), and I have trouble getting the data when someone else is working on the file.
In that case, the file is locked, which makes perfect sense to avoid concurrent edition. One can see it when opening the file with LibreOffice for example, or just staring at the folder as a. .~lock file is present.
However, in my case, I'm just trying to open the file to read it with pandas, not edit it. Libre Office offers this possibility for instance. How is it pandas cannot provide that functionality ?
To be more precise, here is the command:
sheet_df = pd.read_excel(filepath, sheet_name= "Sheet2", engine="odf", skiprows=3)
and the output
File "/Users/user_name/job.py", line 148, in read_file
sheet_df = pd.read_excel(filepath, sheet_name= "Sheet2", engine="odf", skiprows=3)
File "/Users/user_name/.pyenv/versions/virtualenv_prod/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/Users/user_name/.pyenv/versions/virtualenv_prod/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 364, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/Users/user_name/.pyenv/versions/virtualenv_prod/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1233, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/Users/user_name/.pyenv/versions/virtualenv_prod/lib/python3.9/site-packages/pandas/io/excel/_odfreader.py", line 35, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/Users/user_name/.pyenv/versions/virtualenv_prod/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 420, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/Users/user_name/.pyenv/versions/virtualenv_prod/lib/python3.9/site-packages/pandas/io/excel/_odfreader.py", line 46, in load_workbook
return load(filepath_or_buffer)
File "/Users/user_name/.pyenv/versions/virtualenv_prod/lib/python3.9/site-packages/odf/opendocument.py", line 982, in load
z = zipfile.ZipFile(odffile)
File "/Users/user_name/.pyenv/versions/3.9.2/lib/python3.9/zipfile.py", line 1257, in __init__
self._RealGetContents()
File "/Users/user_name/.pyenv/versions/3.9.2/lib/python3.9/zipfile.py", line 1322, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
I'm using python 3.9.2, on a MAC BigSur by the way.
Am I missing something, or pandas.read_excel cannot only read a file ?

Attempting to iterate over a folder of .csv files in a directory using os.lisdir() to convert to a pandas dataframe

I am attempting to iterate over many csv files and convert them to a pandas dataframe to manipulate, then send the manipulated dataframe back to a csv. I am able to do this with a single csv, specifying it's path. However, I am unable to get past listing the workspace that houses the csvs when using os.listdir(). Below is my code and associated error:
import pandas as pd
import os
dir_name = 'C:/PA_Boundaries/Tests/'
dfs = []
for file in os.listdir(dir_name):
df = pd.read_csv(file)
dfs.append(df)
print(dfs)
Error message:
Traceback (most recent call last):
File "<ipython-input-3-c071588e1670>", line 1, in <module>
runfile('C:/PA_Boundaries/osDIRtest.py', wdir='C:/PA_Boundaries')
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/PA_Boundaries/osDIRtest.py", line 7, in <module>
df = pd.read_csv(file)
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\pandas\io\parsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\pandas\io\parsers.py", line 462, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\pandas\io\parsers.py", line 819, in __init__
self._engine = self._make_engine(self.engine)
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\pandas\io\parsers.py", line 1050, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\pandas\io\parsers.py", line 1867, in __init__
self._open_handles(src, kwds)
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\pandas\io\parsers.py", line 1368, in _open_handles
storage_options=kwds.get("storage_options", None),
File "C:\Users\mmulford\AppData\Local\ESRI\conda\envs\arcgispro-py3-clone\lib\site-packages\pandas\io\common.py", line 647, in get_handle
newline="",
FileNotFoundError: [Errno 2] No such file or directory: 'Berks2_output.csv
The file for sure exists in the directory, so I am unsure how it's not able to identify it? any help would be greatly appreciated.

If you use pathlib you code will run correctly on both Windows and MAC OS:
import pandas as pd
import os
import pathlib
from pathlib import Path
dir_name = Path("C:\\PA_Boundaries\\test")
dfs = []
for file in os.listdir(dir_name):
df = pd.read_csv(dir_name / file)
dfs.append(df)
print(dfs)

Import strings with line terminator from csv to dask dataframe

I have a csv with strings containing line terminator I can import with panda with this code :
df_desc = pd.read_csv(import_desc, sep="|")
But when I try to import it in a dask dataframe :
import dask.dataframe as ddf
import_info = "data/info.csv"
df_desc = ddf.read_csv(import_desc, sep="|", blocksize=None, dtype='str')
I get this error :
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1578, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1015, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/data_extraction_dask.py", line 10, in <module>
df_desc = ddf.read_table(import_desc, sep="|", blocksize=None, dtype='str')
File "/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 323, in read
**kwargs)
File "/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 243, in read_pandas
head = reader(BytesIO(b_sample), **kwargs)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 411, in _read
data = parser.read(nrows)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 982, in read
ret = self._engine.read(nrows)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1719, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 130
The documentation mention :
It should also be noted that this function may fail if a CSV file
includes quoted strings that contain the line terminator. To get
around this you can specify blocksize=None to not split files into
multiple partitions, at the cost of reduced parallelism.
That's why I used blocksize=None but this function use a sampling strategy that use the first bytes of the file to determine the type of columns and , I think, generate this error.
I can't skip the samping step even by indicating the type with dtypes.
Is there any workaround ?

Scan a directory tree and reading .csv files into a dataframe using Python

I am trying to walk a directory tree and for each csv encountered on the walk I would like to open the file and read columns 0 and 15 into a data-frame (after which I'll process and move onto the next file. I can walk the directory tree using the following:
rootdir = r'C:/Users/stacey/Documents/Alco/auditopt/'
for dirName,sundirList, fileList in os.walk(rootdir):
print('Found directory: %s' % dirName)
for fname in fileList:
print('\t%s' % fname)
df = pd.read_csv(fname, header=1, usecols=[0,15],parse_dates=[0], dayfirst=True,index_col=[0], names=['date', 'total_pnl_per_pos'])
print(df)
but I'm getting the error message:
FileNotFoundError: File b'auditopt.os-pnl.BBG_XASX_ARB_S-BBG_XTKS_7240_S.csv' does not exist.
I am trying to read files which do exist. They are in an MS Excel .csv format so I don't know if that is an issue - if it is, would someone let me know how I read an MS Excel .csv into a data-frame please.
The full stack trace is as follows:
Found directory: C:/Users/stacey/Documents/Alco/auditopt/
Found directory: C:/Users/stacey/Documents/Alco/auditopt/roll_597_oe_2017-03-10
tradeopt.os-pnl.BBG_XASX_ARB_S-BBG_XTKS_7240_S.csv
Traceback (most recent call last):
File "<ipython-input-24-3753e367432d>", line 1, in <module>
runfile('C:/Users/stacey/Documents/scripts/Pair_Results_Code_1.0.py', wdir='C:/Users/stacey/Documents/scripts')
File "C:\Anaconda\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Anaconda\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/stacey/Documents/scripts/Pair_Results_Code_1.0.py", line 49, in <module>
main()
File "C:/Users/stacey/Documents/scripts/Pair_Results_Code_1.0.py", line 36, in main
df = pd.read_csv(fname, header=1, usecols=[0,15],parse_dates=[0], dayfirst=True,index_col=[0], names=['date', 'total_pnl_per_pos'])
File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 389, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 730, in __init__
self._make_engine(self.engine)
File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 923, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1390, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas\parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4184)
File "pandas\parser.pyx", line 667, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8449)
FileNotFoundError: File b'tradeopt.os-pnl.BBG_XASX_ARB_S-BBG_XTKS_7240_S.csv' does not exist

When reading in the file, you need to provide the full path. os.walk by default does not supply the full path. You'll need to supply it yourself.
Use os.path.join to make this easy.
import os
full_path = os.path.join(dirName, file)
df = pd.read_csv(full_path, ...)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to collect data from different subdirectories containing excel files - python

Related

ParserError when using Pandas script

pandas read_excel from ODS file locked by another user

Attempting to iterate over a folder of .csv files in a directory using os.lisdir() to convert to a pandas dataframe

Import strings with line terminator from csv to dask dataframe

Scan a directory tree and reading .csv files into a dataframe using Python

Categories

Resources