Multiple File Stream Issue - python

In my project, I am using 3 files throughout the whole process. The source file (.ada), a "three address code" file (.TAC), and my own temporary file for use during processing (.TACTMP).
in Caller.py:
TACFILE = open(str(sys.argv[1])[:-4] + ".TAC", 'w') # line 17
# Does a bunch of stuff
TACFILE.close() # line 653
# the below function is imported from Called.py
post_process_file_handler() # line 654
in Called.py:
TAC_FILE_NAME = str(sys.argv[1])[:-4] # line 6
TAC_lines = open(TAC_FILE_NAME + ".TAC", 'r').readlines() # line 7
If I try to run my program without already having a (even if it's blank) .TAC file, I will get the following error:
Traceback (most recent call last):
File "Caller.py", line 8, in <module>
from Called import post_process_file_handler
File "Called.py", line 7, in <module>
TAC_lines = file(TAC_FILE_NAME + ".TAC", 'r').readlines()
IOError: [Errno 2] No such file or directory: 'test76.TAC'
Why would this be happening? This error is being thrown even if I put a breakpoint at the beginning of Caller.py, well before the post_process_file_handler() function ever gets called.
For clarity: test76.TAC should be being generated by Caller.py, and then Called.py should open that file to process it further, for some reason that isn't happening.

This may be specific to my case, but I found out the issue is due to the order and manner in which I was using these streams.
In short, when the import line was encountered:
from Called import post_process_file_handler
it triggered some sort of initialization, and since the file pointer was a global variable in Called.py, it was initialized before Caller.py had a chance to create the .TAC file it would read from.
Moving the import line to just before I use the function fixed my issue, as nothing in Called.py is initialized until after Caller.py is done doing its work.

Related

`concurrent.futures.ProcessPoolExecutor` on Python is ran from beginning of file instead of the defined function

I have a trouble with concurrent.futures. For the short background, I was trying to do a massive image manipulation with python-opencv2. I stumbled upon performance issue, which is a pain considering it can take hours to process only hundreds of image. I found a solution by using concurrent.futures to utilize CPU multicores to make the process go faster (because I noticed while it took really long time to process, it only use like 16% of my 6-core processor, which is roughly a single-core). So I created the code but then I noticed that the multiprocessing actually start from the beginning of the code instead of isolated around the function I just created. Here's the minimal working reproduction of the error:
import glob
import concurrent.futures
import cv2
import os
def convert_this(filename):
### Read in the image data
img = cv2.imread(filename)
### Resize the image
res = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
res.save("output/"+filename)
try:
#create output dir
os.mkdir("output")
with concurrent.futures.ProcessPoolExecutor() as executor:
files = glob.glob("../project/temp/")
executor.map(convert_this, files)
except Exception as e:
print("Encountered Error!")
print(e)
filelist = glob.glob("output")
for f in filelist:
os.remove(f)
os.rmdir("output")
It gave me an error:
Encountered Error!
Encountered Error!
[WinError 183] Cannot create a file when that file already exists: 'output'
Traceback (most recent call last):
File "M:\pythonproject\testfolder\test.py", line 17, in <module>
os.mkdir("output")
[WinError 183] Cannot create a file when that file already exists: 'output'
Encountered Error!
[WinError 183] Cannot create a file when that file already exists: 'output'
Traceback (most recent call last):
File "M:\pythonproject\testfolder\test.py", line 17, in <module>
os.mkdir("output")
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'output'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\<username>\Anaconda3\envs\py37\lib\multiprocessing\spawn.py", line 105, in spawn_main
Encountered Error!
[WinError 183] Cannot create a file when that file already exists: 'output'
Traceback (most recent call last):
File "M:\pythonproject\testfolder\test.py", line 17, in <module>
os.mkdir("output")
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'output'
...
(it was repeating errors of the same "can't create file")
As you see, the os.mkdir was ran even though it's outside of the convert_this function I just defined. I'm not that new to Python but definitely new in multiprocessing and threading. Is this just how concurrent.futures behaves? Or am I missing some documentation reading?
Thanks.
Yes, multiprocessing must load the file in the new processes before it can run the function (just as it does when you run the file yourself), so it runs all code you have written. So, either (1) move your multiprocessing code to a separate file with nothing extra in it and call that, or (2) enclose your top level code in a function (e.g., main()), and at the bottom of your file write
If __name__ == ”__main__":
main()
This code will only be run when you start the script, but not by the multiprocess-spawned version. See Python docs for details on this construction.

Why doesn't fileinput throw an error when there's a bad path?

import fileinput
def main()
try:
lines = fileinput.input()
res = process_lines(lines)
...more code
except Exception:
print('is your file path bad?')
if __name__ == '__main__':
main()
When I run this code with a bad path, it doesn't throw an error, yet the docs say that an OS error will be thrown if there's an IO error. How do I test for bad paths then?
fileinput.input() returns an iterator, not an ad-hoc list:
In [1]: fileinput.input()
Out[1]: <fileinput.FileInput at 0x7fa9bea55a50>
Proper usage of this function is done via a for loop:
with fileinput.input() as files:
for line in files:
process_line(line)
or using a conversion to list:
lines = list(fileinput.input())
I.e. the files are opened only when you actually iterate over this object.
Although I wouldn't recommend the second way, as it is counter to the philosophy of how such scripts are supposed to work.
You are supposed to parse as little as you need to output data, and then output it as soon as possible. This avoids issues with large inputs, and if your script is used within a larger pipeline, speeds up the processing significantly.
With regards to checking whether the path is correct or not:
As soon as you'll iterate down to the file that doesn't exist, the iterator will throw an exception:
# script.py
import fileinput
with fileinput.input() as files:
for line in files:
print(repr(line))
$ echo abc > /tmp/this_exists
$ echo xyz > /tmp/this_also_exists
$ python script.py /tmp/this_exists /this/does/not /tmp/this_also_exists
'abc\n'
Traceback (most recent call last):
File "/tmp/script.py", line 6, in <module>
for line in files:
File "/home/mrmino/.pyenv/versions/3.7.7/lib/python3.7/fileinput.py", line 252, in __next__
line = self._readline()
File "/home/mrmino/.pyenv/versions/3.7.7/lib/python3.7/fileinput.py", line 364, in _readline
self._file = open(self._filename, self._mode)
FileNotFoundError: [Errno 2] No such file or directory: '/this/does/not'

Python OSError: [Errno 9] Bad file descriptor after opening big json file

I just tried to read in a big json file (the Wikipedia json dump) in Python line by line and got the Error:
Traceback (most recent call last):
File "C:/.../test_json_wiki_file.py", line 19, in <module>
test_fct()
File "C:/.../test_json_wiki_file.py", line 12, in test_fct
for line in f:
OSError: [Errno 9] Bad file descriptor
Here is my code:
import json
def test_fct():
data = []
i = 0
with open('E:/.../20200713.json/20200713.json') as f:
for line in f:
data.append(json.loads(line))
i = i + 1
if i > 1:
input_file.close()
return data
test_data = test_fct()
The file size is around 700GB and the description (https://www.wikidata.org/wiki/Wikidata:Database_download) of the file states that it can be read line by line. I don't know if this is important but the E:/ hard drive is an external one.
Thank you for your help in advance :)
I don't have any firsthand knowledge on opening large files in python, but did you mean to have the path as 20200713.json/20200713.json. Is the first one actually a directory that has a .json extension? I'd also suggest trying to first load a smaller sample of the file (opening might be hard, so maybe just use the more command in terminal?).

gensim file not found error

I am executing the following line:
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
This code is available at "https://radimrehurek.com/gensim/wiki.html". I downloaded the wikipedia corpus and generated the required files and wiki_en_wordids.txt is one of those files. This file is available in the following location:
~/gensim/results/wiki_en
So when i execute the code mentioned above I get the following error:
Traceback (most recent call last):
File "~\Python\Python36-32\temp.py", line 5, in <module>
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
File "~\Python\Python36-32\lib\site-packages\gensim\corpora\dictionary.py", line 344, in load_from_text
with utils.smart_open(fname) as f:
File "~\Python\Python36-32\lib\site-packages\smart_open\smart_open_lib.py", line 129, in smart_open
return file_smart_open(parsed_uri.uri_path, mode)
File "~\Python\Python36-32\lib\site-packages\smart_open\smart_open_lib.py", line 613, in file_smart_open
return open(fname, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'wiki_en_wordids.txt'
Even though the file is available in the required location I get that error. Should I place the file in any other location? How do I determine what the right location is?
The code requires an absolute path here. Relative path should be used when entire operation is carried out in the same directory location, but in this case, the file name is passed as argument to some other function which is located at different location.
One way to handle this situation is using abspath -
import os
id2word = gensim.corpora.Dictionary.load_from_text(os.path.abspath('wiki_en_wordids.txt'))

Python IOError: [Errno 24] Too many open files

I plan to pass a user code script to python logger, run it using python bdb, then log the output in a file.
Here is my code in python logger:
try:
logger._runscript(script_str)
except bdb.BdbQuit:
pass
finally:
logger.finalize(filename)
where logger.finalize is defined below as function finalizer(output, filename).
The bdb will spawn a new thread and call the following finalizer function after execution:
def finalizer(output, filename):
outfile = open(filename , 'a')
outfile.write(json.dumps(output, indent = 4))
outfile.close()
Here output is execution result and we will write it to a file with filename.
I tested the three lines in the finalizer function, and they ran okay.
However, when they were called from python logger, I always get the following error message:
IOError: [Errno 24] Too many open files: filename
I only open one file, append a string to its end, then close it. Why are there "too many open files"? Can anyone kindly point me to the problem?
Here is the TraceBack info:
Traceback (most recent call last):
File "./exec.py", line 95, in <module>
File "./exec.py", line 82, in main
File "./exec.py", line 45, in run
File "path to project/logger.py", line 1321, in exec_script_str
File "path to project/logger.py", line 1292, in finalize
File "./exec.py", line 24, in finalizer
IOError: [Errno 24] Too many open files: 'test01.py'
This doesn't look like the logger from the standard library. Rather it looks like you are trying to use a third-party library that provides a sandbox environment. One of the restrictions is which that it forbids file access (see line 1233).
If you are creating the logger object yourself you can disable these security checks by creating with the appropriate flag eg.
def exec_str_with_user_ns(script_str, user_ns, finalizer_func):
logger = PGLogger(False, False, False, finalizer_func, disable_security_checks=True)
try:
logger._runscript(script_str, user_ns)
except bdb.BdbQuit:
pass
finally:
return logger.finalize()

Categories