Pandas read_json too many open files error - python

I keep getting a 'Too many open files' error when doing something like this:
# read file names
file_names = []
for file_name in os.listdir(path):
if '.json' not in file_name: continue
file_names.append(file_name)
# process file names...
# iter files
for file_name in file_names:
# load file into DF
file_path = path + '/' + file_name
df = pandas.read_json(file_path)
# process the data, etc...
# not real var names, just for illustration purposes...
json_arr_1 = ...
json_arr_2 = ...
# save DF1 to new file
df_1 = pandas.DataFrame(data=json_arr_1)
file_name2 = os.getcwd() + '/db/' + folder_name + '/' + file_name
df_1.to_json(file_name2, orient='records')
# save DF2 to new file
df_2 = pandas.DataFrame(data=json_arr_2)
file_name3 = os.getcwd() + '/db/other/' + folder_name + '/' + file_name
df_2.to_json(file_name3, orient='records')
The DF documentation doesn't mention having to handle open or closed files and I don't think listdir keeps pointers to open files (should just return a list of strings).
Where am I going wrong?

It seems like a system issue, and not pandas issue.
You might need to increase the number of open files in the system.
How to increase number:
https://easyengine.io/tutorials/linux/increase-open-files-limit/
The following Q&A:
IOError: [Errno 24] Too many open files:
discuss about ulimit and the limit of open files
This Q&A discuss about number of open files in Linux:
https://unix.stackexchange.com/questions/36841/why-is-number-of-open-files-limited-in-linux

Related

I'm using os and pandas to modify multiple excel spreadsheets by iterating through a folder but ~$ keeps getting added to the file name

My initial code is here:
import pandas as pd
import os
directory_in_str = input('\n\nEnter the name of the folder you would like to use. If there are spaces, replace with underscores: ')
directory_in_str.strip()
directory = os.fsencode(directory_in_str)
user = input('\nEnter your first initial and last name as one word (ex: username): ')
user.strip()
path1 = '/Users/'
path2 = '/Desktop/DataScience/'
dspath = path1 + user + path2
slash = '/'
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".xls") or filename.endswith(".xlsx"):
print(directory)
pathname = dspath + directory_in_str + slash + filename
print(filename)
#Global = pd.read_excel(pathname, sheet_name=0)
Stats = pd.read_excel(pathname, sheet_name=1)
listorder = ['1', '2', '3']
Stats = Stats.reindex(columns=listorder)
Stats.to_excel(filename, sheet_name='Statistics', index=False)
continue
else:
continue
I've included the filename print statement to insure that the correct path is being used. However, the print statement happens twice.
These are the statements printed.
b'testrearrange'
Testname.xlsx
b'testrearrange'
~$Testname.xlsx
Why are the two characters '~$' added? The error originates from the line
Stats = pd.read_excel(pathname, sheet_name=1)
with the error
ValueError: File is not a recognized excel file
Does anyone know how to fix this?
I think the files starting with "~$# are temporary excel files that are created when you open the file in excel. One option is to close the file, in which case the temporary file is deleted. Other option is to change the logic by which you list the files to be read so that it ignores files that start with ~. I like to use glob for this
from glob import glob
path = "C:/Users/Wolf/[!~]*.xls*"
files = glob(path)
for file in files:
print("Do your thing here")

Importing for loop to skip data file not exist

I have about 500 Excel files in the format: data_1, data_2 ... data_500
However, not all file are there. File like data_3 is not in the folder.
I want to import all available data into dataframe.
However, the my code below will stop when it hit a name of file not in the list, say data_3
Can you please help me to skip these record?
Thank you,
HN
for i in range(500):
filename='data_'+ str(i) + 'xlsx'
output = pd.read_excel('PATH' + filename)
THE KEY IS CHECK IN FULL PATH IN glob.glob
import glob
for i in xlx_file_list:
filename = 'Excel_Sample' + str(i) + '.xlsx' #; print(filename)
full_path = 'D:\Python...\\' + filename #; print(full_path)
if full_path not in glob.glob('D:\Python...\*'):
print(filename, ' not in folder')
continue
outfile = pd.read_excel(full_path, sheet_name='data_sheet')
print(outfile)
Hi in your sample probably PATH is a variable, not a string, 'PATH'+filename cannot work.
i suggest to use os.path.join() to compose file path, don't use string composition for this.
There are two way to solve this problem:
Generate all names and see if the file exists:
import os
for i in range(500):
filename='data_'+ str(i) + 'xlsx'
if os.path.exists(filename)
output = pd.read_excel(filename)
or generate only the correct filename list:
import glob
for filename in glob.glob('data_*.xlsx'):
output = pd.read_excel(filename)

How to read files from two folders and avoid duplicates in Python

I have the following folders that I read SQL files from and save them as variables:
++folder1
-1.sql
-2.sql
-3.sql
++folder2
-2.sql
The following code does the job well for a single folder. How I can modify this code to read not just from one folder but from two with a rule that if a file exists in folder2 than don't read the file with the same name from folder1?
folder1 = '../folder1/'
for filename in os.listdir(folder1):
path = os.path.join(folder1, filename)
if os.path.isdir(path):
continue
with open(folder1 + filename, 'r') as myfile:
data = myfile.read()
query_name = filename.replace(".sql", "")
exec (query_name + " = data")
You can try something like follows:
folders = ['../folder2/','../folder1/']
checked =[]
for folder in folders:
for filename in os.listdir(folder):
if filename not in checked:
checked.append(filename)
path = os.path.join(folder, filename)
if os.path.isdir(path):
continue
with open(folder + filename, 'r') as myfile:
data = myfile.read()
query_name = filename.replace(".sql", "")
exec (query_name + " = data")
The answer to this is simple: Do two listdir calls, then skip over the files in folder1 that are also in folder2.
One way to do this is with set operations: the set difference a - b means all elements in a that are not also in b, which is exactly what you want.
files1 = set(os.listdir(folder1))
files2 = set(os.listdir(folder2))
files1 -= files2
paths1 = [os.path.join(folder1, file) for file in files1]
paths2 = [os.path.join(folder2, file) for file in files2]
for path in paths1 + paths2:
if os.path.isdir(path):
# etc.
As a side note, dynamically creating a bunch of variables like this is almost always a very bad idea, and doing it with exec instead of globals or setattr is an even worse idea. It's usually be much better to store everything in, e.g., a dict. For example:
queries = {}
for path in paths1 + paths2:
if os.path.isdir(path):
continue
name = os.path.splitext(os.path.basename(path))[0]
with open(path) as f:
queries[name] = f.read()

renaming the extracted file from zipfile

I have lots of zipped files on a Linux server and each file includes multiple text files.
what I want is to extract some of those text files, which have the same name across zipped files and save it a folder; I am creating one folder for each zipped file and extract the text file to it. I need to add the parent zipped folder name to the end of file names and save all text files in one directory. For example, if the zipped folder was March132017.zip and I extracted holding.txt, my filename would be holding_march13207.txt.
My problem is that I am not able to change the extracted file's name.
I would appreciate if you could advise.
import os
import sys
import zipfile
os.chdir("/feeds/lipper/emaxx")
pwkwd = "/feeds/lipper/emaxx"
for item in os.listdir(pwkwd): # loop through items in dir
if item.endswith(".zip"): # check for ".zip" extension
file_name = os.path.abspath(item) # get full path of files
fh = open(file_name, "rb")
zip_ref = zipfile.ZipFile(fh)
filelist = 'ISSUERS.TXT' , 'SECMAST.TXT' , 'FUND.TXT' , 'HOLDING.TXT'
for name in filelist :
try:
outpath = "/SCRATCH/emaxx" + "/" + os.path.splitext(item)[0]
zip_ref.extract(name, outpath)
except KeyError:
{}
fh.close()
import zipfile
zipdata = zipfile.ZipFile('somefile.zip')
zipinfos = zipdata.infolist()
# iterate through each file
for zipinfo in zipinfos:
# This will do the renaming
zipinfo.filename = do_something_to(zipinfo.filename)
zipdata.extract(zipinfo)
Reference:
https://bitdrop.st0w.com/2010/07/23/python-extracting-a-file-from-a-zip-file-with-a-different-name/
Why not just read the file in question and save it yourself instead of extracting? Something like:
import os
import zipfile
source_dir = "/feeds/lipper/emaxx" # folder with zip files
target_dir = "/SCRATCH/emaxx" # folder to save the extracted files
# Are you sure your files names are capitalized in your zip files?
filelist = ['ISSUERS.TXT', 'SECMAST.TXT', 'FUND.TXT', 'HOLDING.TXT']
for item in os.listdir(source_dir): # loop through items in dir
if item.endswith(".zip"): # check for ".zip" extension
file_path = os.path.join(source_dir, item) # get zip file path
with zipfile.ZipFile(file_path) as zf: # open the zip file
for target_file in filelist: # loop through the list of files to extract
if target_file in zf.namelist(): # check if the file exists in the archive
# generate the desired output name:
target_name = os.path.splitext(target_file)[0] + "_" + os.path.splitext(file_path)[0] + ".txt"
target_path = os.path.join(target_dir, target_name) # output path
with open(target_path, "w") as f: # open the output path for writing
f.write(zf.read(target_file)) # save the contents of the file in it
# next file from the list...
# next zip file...
You could simply run a rename after each file is extracted right? os.rename should do the trick.
zip_ref.extract(name, outpath)
parent_zip = os.path.basename(os.path.dirname(outpath)) + ".zip"
new_file_name = os.path.splitext(os.path.basename(name))[0] # just the filename
new_name_path = os.path.dirname(outpath) + os.sep + new_file_name + "_" + parent_zip
os.rename(outpath, new_namepath)
For the filename, if you want it to be incremental, simply start a count and for each file, go up by on.
count = 0
for file in files:
count += 1
# ... Do our file actions
new_file_name = original_file_name + "_" + str(count)
# ...
Or if you don't care about the end name you could always use something like a uuid.
import uuid
random_name = uuid.uuid4()
outpath = '/SCRATCH/emaxx'
suffix = os.path.splitext(item)[0]
for name in filelist :
index = zip_ref.namelist().find(name)
if index != -1: # check the file exists in the zipfile
filename, ext = os.path.splitext(name)
zip_ref.filelist[index].filename = f'{filename}_{suffix}.{ext}' # rename the extracting file to the suffix file name
zip_ref.extract(zip_ref.filelist[index], outpath) # use the renamed file descriptor to extract the file
I doubt this is possible to rename file during their extraction.
What about renaming files once they are extracted ?
Relying on linux bash, you can achieve it in a one line :
os.system("find "+outpath+" -name '*.txt' -exec echo mv {} `echo {} | sed s/.txt/"+zipName+".txt/` \;")
So, first we search all txt files in the specified folder, then exec the renaming command, with the new name computed by sed.
Code not tested, i'm on windows now ^^'

Exporting multiple files with different filenames

Lets say I have n files in a directory with filenames: file_1.txt, file_2.txt, file_3.txt .....file_n.txt. I would like to import them into Python individually and then do some computation on them, and then store the results into n corresponding output files: file_1_o.txt, file_2_o.txt, ....file_n_o.txt.
I've figured out how to import multiple files:
import glob
import numpy as np
path = r'home\...\CurrentDirectory'
allFiles = glob.glob(path + '/*.txt')
for file in allFiles:
# do something to file
...
...
np.savetxt(file, ) ???
Not quite sure how to append the _o.txt (or any string for that matter) after the filename so that the output file is file_1_o.txt
Can you use the following snippet to build the output filename?
parts = in_filename.split(".")
out_filename = parts[0] + "_o." + parts[1]
where I assumed in_filename is of the form "file_1.txt".
Of course would probably be better to put "_o." (the suffix before the extension) in a variable so that you can change at will just in one place and have the possibility to change that suffix more easily.
In your case it means
import glob
import numpy as np
path = r'home\...\CurrentDirectory'
allFiles = glob.glob(path + '/*.txt')
for file in allFiles:
# do something to file
...
parts = file.split(".")
out_filename = parts[0] + "_o." + parts[1]
np.savetxt(out_filename, ) ???
but you need to be careful, since maybe before you pass out_filename to np.savetxt you need to build the full path so you might need to have something like
np.savetxt(os.path.join(path, out_filename), )
or something along those lines.
If you would like to combine the change in basically one line and define your "suffix in a variable" as I mentioned before you could have something like
hh = "_o." # variable suffix
..........
# inside your loop now
for file in allFiles:
out_filename = hh.join(file.split("."))
which uses another way of doing the same thing by using join on the splitted list, as mentioned by #NathanAck in his answer.
import os
#put the path to the files here
filePath = "C:/stack/codes/"
theFiles = os.listdir(filePath)
for file in theFiles:
#add path name before the file
file = filePath + str(file)
fileToRead = open(file, 'r')
fileData = fileToRead.read()
#DO WORK ON SPECIFIC FILE HERE
#access the file through the fileData variable
fileData = fileData + "\nAdd text or do some other operations"
#change the file name to add _o
fileVar = file.split(".")
newFileName = "_o.".join(fileVar)
#write the file with _o added from the modified data in fileVar
fileToWrite = open(newFileName, 'w')
fileToWrite.write(fileData)
#close open files
fileToWrite.close()
fileToRead.close()

Categories