Python glob.glob(dir) Memory Error - python

I'm dealing with a memory issue when searching a folder with milions of files. Does anyone know how to overcome this situation? Is there some way to put a limit of files that glob will search? So it could be executed in chunks?
Traceback (most recent call last):
File "./lb2_lmanager", line 533, in <module>
main(sys.argv[1:])
File "./lb2_lmanager", line 318, in main
matched = match_files(policy.directory, policy.file_patterns)
File "./lb2_lmanager", line 32, in wrapper
res = func(*args, **kwargs)
File "./lb2_lmanager", line 380, in match_files
listing = glob.glob(directory)
File "/usr/lib/python2.6/glob.py", line 16, in glob
return list(iglob(pathname))
File "/usr/lib/python2.6/glob.py", line 43, in iglob
yield os.path.join(dirname, name)
File "/usr/lib/python2.6/posixpath.py", line 70, in join
path += '/' + b
MemoryError

Try using generators instead of lists.
To understand what generators are read this
import glob
dir_list = glob.iglob(YOUR_DIRECTORY)
for file in dir_list:
print file
Change YOUR_DIRECTORY to the directory that you would like to list.

Related

Python - Parsing all XML files in a folder to CSV files

I just started learning python so this might be a very basic question but here's where I'm stuck.
I'm trying to parse ALL XML files in a given folder and outputting CSV files, with the same filename as the original XML files. I've tested with single files and it works perfectly but the issue I'm having is with performing the same for all of them and having that running on a loop as it would be a perpetual script.
Here my code:
import os
import xml.etree.cElementTree as Eltree
import pandas as pd
path = r'C:/python_test'
filenames = []
for filename in os.listdir(path):
if not filename.endswith('.xml'): continue
fullname = os.path.join(path, filename)
print(fullname)
filenames.append(fullname)
cols = ["serviceName", "startDate", "endDate"]
rows = []
for filename in filenames:
xmlparse = Eltree.parse(filename)
root = xmlparse.getroot()
csvoutput=[]
for fixed in root.iter('{http://www.w3.org/2001/XMLSchema}channel'):
channel = fixed.find("channelName").text
for dyn in root.iter('programInformation'):
start = dyn.find("publishedStartTime").text
end = dyn.find("endTime").text
rows.append({"serviceName": channel, "startDate": start, "endDate": end})
df = pd.DataFrame(rows, columns=cols)
df.to_csv(csvoutput)
This is the error I'm getting:
C:/python_test\1.xml
C:/python_test\2.xml
C:/python_test\3.xml
C:/python_test\4.xml
C:/python_test\5.xml
C:/python_test\6.xml
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 49, in <module>
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 3466, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\formats\format.py", line 1105, in to_csv
csv_formatter.save()
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\formats\csvs.py", line 237, in save
with get_handle(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\common.py", line 609, in get_handle
ioargs = _get_filepath_or_buffer(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\common.py", line 396, in _get_filepath_or_buffer
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'list'>
Any kind of suggestions would be greatly appreciated!
Many thanks!
This is you bug:
csvoutput=[] is defined as a list. Later on you pass it as argument to df.to_csv(csvoutput). So you are passing a list to a method that looks for a file path.

Pandas and glob: convert all xlsx files in folder to csv – TypeError: __init__() got an unexpected keyword argument 'xfid'

I have a folder with many xlsx files that I'd like to convert to csv files.
During my research, if found several threads about this topic, such as this or that one. Based on this, I formulated the following code using glob and pandas:
import glob
import pandas as pd
path = r'/Users/.../xlsx files'
excel_files = glob.glob(path + '/*.xlsx')
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel) # error occurs here
df.to_csv(out)
But unfortunately, I got the following error message that I could not interpret in this context and I could not figure out how to solve this problem:
Traceback (most recent call last):
File "<input>", line 11, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1131, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 475, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 391, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 486, in load_workbook
return load_workbook(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 317, in load_workbook
reader.read()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 281, in read
apply_stylesheet(self.archive, self.wb)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/styles/stylesheet.py", line 198, in apply_stylesheet
stylesheet = Stylesheet.from_tree(node)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/styles/stylesheet.py", line 103, in from_tree
return super(Stylesheet, cls).from_tree(node)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 87, in from_tree
obj = desc.expected_type.from_tree(el)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 87, in from_tree
obj = desc.expected_type.from_tree(el)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 103, in from_tree
return cls(**attrib)
TypeError: __init__() got an unexpected keyword argument 'xfid'
Does anyone know how to fix this? Thanks a lot for your help!
I had the same problem here. After some hours thinking and searching I realized the problem is, actually, the file. I opened it using MS Excel, and save. Alakazan, problem solved.
The file was downloaded, so i think it's a "security" error or just an error from how the file was created. xD
EDIT:
It's not a security problem, but actually an error from the generation of file. The correct has the double of kb the wrong file.
An solution is: if using xlrd==1.2.0 the file can be opened, you can, after doing this, call read_excel to the Book(file opened by xlrd).
import xlrd
# df = pd.read_excel('TabelaPrecos.xlsx')
# The line above is the same result
a = xlrd.open_workbook('TabelaPrecos.xlsx')
b = pd.read_excel(a)

loop through and load a zipped folder of yaml files

I have a zipped folder containing 15 000 yaml files. I'd like to iterate through the folder using yaml.safe_load so that each file is in a dictionary format and I can extract information from each file that I need. I've written some code so far using zipfile.ZipFile and yaml.safe_load but it only works for the first file in the zipped folder. Would anyone please mind taking a look and explaining what I'm misunderstanding please?
zip_file = zipfile.ZipFile("D:/export.zip")
files = zip_file.namelist()
print(files)
for i in range(10):
with zip_file.open(files[i]) as yamlfile:
yamlreader = yaml.safe_load(yamlfile)
print(yamlreader["identifier"])
for now I'm just iterating through 10 files to make life easier. Eventually I'd like to do the whole 15 000. "identifier" is a key in the yaml file.
This is the error:
10.5281/zenodo.1014773
Traceback (most recent call last):
File "C:/Users/estho/PycharmProjects/GSOC3/testing_dataextraction.py", line 20, in <module>
yamlreader = yaml.safe_load(yamlfile)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\__init__.py", line 162, in safe_load
return load(stream, SafeLoader)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\__init__.py", line 114, in load
return loader.get_single_data()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\constructor.py", line 41, in get_single_data
node = self.get_single_node()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 36, in get_single_node
document = self.compose_document()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\composer.py", line 127, in compose_mapping_node
while not self.check_event(MappingEndEvent):
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\parser.py", line 98, in check_event
self.current_event = self.state()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\parser.py", line 428, in parse_block_mapping_key
if self.check_token(KeyToken):
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "C:\Users\estho\PycharmProjects\GSOC3\lib\site-packages\yaml\scanner.py", line 260, in fetch_more_tokens
self.get_mark())
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
in "yamlfile_10_5281_zenodo_1745362.yaml", line 4, column 1
Thank you.
It seems to me like in the file "yamlfile_10_5281_zenodo_1745362.yaml" there is a bad token name. Try running it without this file. In python \t is representative of a tab and so cannot be included in a string ect normally without escaping it.

Using os.path.join with os.path.getsize, returning FileNotFoundError

In conjunction with my last question, I'm onto printing the filenames with their sizes next to them in a sort of list. Basically I am reading filenames from one file (which are added by the user), taking the filename and putting it in the path of the working directory to print it's size one-by-one, however I'm having an issue with the following block:
print("\n--- Stats ---\n")
with open('userdata/addedfiles', 'r') as read_files:
file_lines = read_files.readlines()
# get path for each file and find in trackedfiles
# use path to get size
print(len(file_lines), "files\n")
for file_name in file_lines:
# the actual files should be in the same working directory
cwd = os.getcwd()
fpath = os.path.join(cwd, file_name)
fsize = os.path.getsize(fpath)
print(file_name.strip(), "-- size:", fsize)
which is returning this error:
tolbiac wpm-public → ./main.py --filestatus
--- Stats ---
1 files
Traceback (most recent call last):
File "./main.py", line 332, in <module>
main()
File "./main.py", line 323, in main
parseargs()
File "./main.py", line 317, in parseargs
tracking()
File "./main.py", line 204, in tracking
fsize = os.path.getsize(fpath)
File "/usr/lib/python3.4/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/home/tolbiac/code/wpm-public/file.txt\n'
tolbiac wpm-public →
So it looks like something is adding a \n to the end of file_name, I'm not sure if thats something used in the getsize module, I tried this with os.stat, but it did the same thing.
Any suggestions? Thanks.
When you're reading in a file, you need to be aware of how the data is being seperated. In this case, the read-in file has a filename once per line seperated out by that \n operator. Need to strip it then before you use it.
for file_name in file_lines:
file_name = file_name.strip()
# rest of for loop

scipy.io typeerror:buffer too small for requested array

I have a problem in python. I'm using scipy, where i use scipy.io to load a .mat file. The .mat file was created using MATLAB.
listOfFiles = os.listdir(loadpathTrain)
for f in listOfFiles:
fullPath = loadpathTrain + '/' + f
mat_contents = sio.loadmat(fullPath)
print fullPath
Here's the error:
Traceback (most recent call last):
File "tryRankNet.py", line 1112, in <module>
demo()
File "tryRankNet.py", line 645, in demo
mat_contents = sio.loadmat(fullPath)
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/mio.py", line 111, in loadmat
matfile_dict = MR.get_variables()
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/miobase.py", line 356, in get_variables
getter = self.matrix_getter_factory()
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/mio5.py", line 602, in matrix_getter_factory
return self._array_reader.matrix_getter_factory()
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/mio5.py", line 274, in matrix_getter_factory
tag = self.read_dtype(self.dtypes['tag_full'])
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/miobase.py", line 171, in read_dtype
order='F')
TypeError: buffer is too small for requested array
The whole thing is in a loop, and I checked the size of the file where it gives the error by loading it interactively in IDLE.
The size is (9,521), which is not at all huge. I tried to find if I'm supposed to clear the buffer after each iteration of the loop, but I could not find anything.
Any help would be appreciated.
Thanks.

Categories