how to mock a file object in python3 - python

In python2 I have this in my test method:
mock_file = MagicMock(spec=file)
I'm moving to python3, and I can't figure out how to do a similar mock. I've tried:
from io import IOBase
mock_file = MagicMock(spec=IOBase)
mock_file = create_autospec(IOBase)
What am I missing?

IOBase does not implement crucial file methods such as read and write and is therefore usually unsuitable as a spec to create a mocked file object with. Depending on whether you want to mock a raw stream, a binary file or a text file, you can use RawIOBase, BufferedIOBase or TextIOBase as a spec instead:
from io import BufferedIOBase
mock_file = MagicMock(spec=BufferedIOBase)

Related

Using fastAPI UploadFile with libraries that accept a file-like object

I would like to know the best practice for accessing the underlying file object when uploading a file with fastAPI.
When uploading a file with fastapi, the object we get is a starlette.datastructures.UploadFile.
We can access the underlying file attribute which is a tempfile.SpooledTemporaryFile.
We can then access the underlying private _file attribute, and pass it to libraries that expect a file-like object.
Below is an example of two routes with python-docx and pdftotext:
#router.post("/open-docx")
async def open_docx(upload_file: UploadFile = File(...)):
mydoc = docx.Document(upload_file.file._file) #
# do something
#router.post("/open-pdf")
async def open_pdf(upload_file: UploadFile = File(...)):
mypdf = pdftotext.PDF(upload_file.file._file)
# do something
However, I don't like the idea of accessing the private _file attribute. Is there a better way to pass the file object without saving it first?
Note: to reproduce the example above, put the following code in launch.py and run uvicorn launch:app --reload:
import docx
import pdftotext
from fastapi import FastAPI, File, UploadFile
app = FastAPI()
#app.post("/open-docx")
async def open_docx(upload_file: UploadFile = File(...)):
mydoc = docx.Document(upload_file.file._file)
# do something
return {"firstparagraph": mydoc.paragraphs[0].text}
#app.post("/open-pdf")
async def open_pdf(upload_file: UploadFile = File(...)):
mypdf = pdftotext.PDF(upload_file.file._file)
# do something
return {"firstpage": mypdf[0]}
See documentation for UploadFile here:
https://fastapi.tiangolo.com/tutorial/request-files/#uploadfile
It says it has an attribute named
file: A SpooledTemporaryFile (a file-like object).
This is the actual Python file that you can pass directly to
other functions or libraries that expect a "file-like" object.
So you dont need to access the private ._file attribute.
Just pass the
upload_file.file
Note that this is untested. Per the docs you can use the SpooledTemporaryFile object without accessing its underlying wrapper:
The returned object is a file-like object whose _file attribute is either an io.BytesIO or io.TextIOWrapper object (depending on whether binary or text mode was specified) or a true file object, depending on whether rollover() has been called. This file-like object can be used in a with statement, just like a normal file.
https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile (Emphasis added).
Thus it would seem you can just do:
#router.post("/open-docx")
async def open_docx(upload_file: UploadFile = File(...)):
with upload_file.file as f:
mydoc = docx.Document(f)
Avoid the with if you don't want to close the file.
Edit: Seekable files
SpooledTemporaryFiles are not properly seekable, sadly. See this question. It boils down to the fact that they roll over and land on the disk after a certain point, at which point seeking is harder. Accessing the _file attribute is not safe because of this rollover. Thus you either need to save them to disk, or to read them explicitly into a virtual (ram) file.
If you are on Linux, a workaround might be to save the files to /dev/shm or /tmp, the former of which is in ram, the latter often a ramdisk, and let the OS handle swapping massive files to disk.

How should a NamedTemporaryFile be annotated?

I tried typing.IO as suggested in Type hint for a file or file-like object?, but it doesn't work:
from __future__ import annotations
from tempfile import NamedTemporaryFile
from typing import IO
def example(tmp: IO) -> str:
print(tmp.file)
return tmp.name
print(example(NamedTemporaryFile()))
for this, mypy tells me:
test.py:6: error: "IO[Any]" has no attribute "file"; maybe "fileno"?
and Python runs fine. So the code is ok.
I don't think this can be easily type hinted.
If you check the definition of NamedTemporaryFile, you'll see that it's a function that ends in:
return _TemporaryFileWrapper(file, name, delete)
And _TemporaryFileWrapper is defined as:
class _TemporaryFileWrapper:
Which means there isn't a super-class that can be indicated, and _TemporaryFileWrapper is "module-private". It also doesn't look like it has any members that make it a part of an existing Protocol * (except for Iterable and ContextManager; but you aren't using those methods here).
I think you'll need to use _TemporaryFileWrapper and ignore the warnings:
from tempfile import _TemporaryFileWrapper # Weak error
def example(tmp: _TemporaryFileWrapper) -> str:
print(tmp.file)
return tmp.name
If you really want a clean solution, you could implement your own Protocol that includes the attributes you need, and have it also inherit from Iterable and ContextManager. Then you can type-hint using your custom Protocol.
* It was later pointed out that it does fulfil IO, but the OP requires attributes that aren't in IO, so that can't be used.

How to open an already opened file in python?

I'm currently using a library that loads a file using open(filename).
I don't want to mess with the file system, so I tried to download this file in memory using BytesIO:
obj = BytesIO(requests(url).content)
But, if I pass the obj to the library, I'll get an error.
How can I transform my object so it could be "opened" by open(object)?
You can override the built-in open function to return the first argument directly if the argument is a file-like object (which can be identified if it has a read attribute):
import builtins
original_open = open
builtins.open = lambda f, *args, **kwargs: f if hasattr(f, 'read') else original_open(f, *args, **kwargs)
so that:
from io import BytesIO
print(open(BytesIO(b'hello world'), 'rb').read())
outputs:
b'hello world'
You can't unless you want to save it as a file because the open() method can only be used for files contained in the file system. Instead, you can check out the python docs on io streams (found here: https://docs.python.org/3/library/io.html) and learn how to access your data through io methods.

Implement Custom Str or Buffer in Python

I'm working with python-gnupg to decrypt a file and the decrypted file content is very large so loading the entire contents into memory is not feasible.
I would like to short-circuit the write method in order to to manipulate the decrypted contents as it is written.
Here are some failed attempts:
import gpg
from StringIO import StringIO
# works but not feasible due to memory limitations
decrypted_data = gpg_client.decrypt_file(decrypted_data)
# works but no access to the buffer write method
gpg_client.decrypt_file(decrypted_data, output=buffer())
# fails with TypeError: coercing to Unicode: need string or buffer, instance found
class TestBuffer:
def __init__(self):
self.buffer = StringIO()
def write(self, data):
print('writing')
self.buffer.write(data)
gpg_client.decrypt_file(decrypted_data, output=TestBuffer())
Can anyone think of any other ideas that would allow me to create a file-like str or buffer object to output the data to?
You can implement a subclass of one of the classes in the io module described in I/O Base Classes, presumably io.BufferedIOBase. The standard library contains an example of something quite similar in the form of the zipfile.ZipExtFile class. At least this way, you won't have to implement complex functions like readline yourself.

Wrapping urllib3.HTTPResponse in io.TextIOWrapper

I use AWS boto3 library which returns me an instance of urllib3.response.HTTPResponse. That response is a subclass of io.IOBase and hence behaves as a binary file. Its read() method returns bytes instances.
Now, I need to decode csv data from a file received in such a way. I want my code to work on both py2 and py3 with minimal code overhead, so I use backports.csv which relies on io.IOBase objects as input rather than on py2's file() objects.
The first problem is that HTTPResponse yields bytes data for CSV file, and I have csv.reader which expects str data.
>>> import io
>>> from backports import csv # actually try..catch statement here
>>> from mymodule import get_file
>>> f = get_file() # returns instance of urllib3.HTTPResponse
>>> r = csv.reader(f)
>>> list(r)
Error: iterator should return strings, not bytes (did you open the file in text mode?)
I tried to wrap HTTPResponse with io.TextIOWrapper and got error 'HTTPResponse' object has no attribute 'read1'. This is expected becuase TextIOWrapper is intended to be used with BufferedIOBase objects, not IOBase objects. And it only happens on python2's implementation of TextIOWrapper because it always expects underlying object to have read1 (source), while python3's implementation checks for read1 existence and falls back to read gracefully (source).
>>> f = get_file()
>>> tw = io.TextIOWrapper(f)
>>> list(csv.reader(tw))
AttributeError: 'HTTPResponse' object has no attribute 'read1'
Then I tried to wrap HTTPResponse with io.BufferedReader and then with io.TextIOWrapper. And I got the following error:
>>> f = get_file()
>>> br = io.BufferedReader(f)
>>> tw = io.TextIOWrapper(br)
>>> list(csv.reader(f))
ValueError: I/O operation on closed file.
After some investigation it turns out that the error only happens when the file doesn't end with \n. If it does end with \n then the problem does not happen and everything works fine.
There is some additional logic for closing underlying object in HTTPResponse (source) which is seemingly causing the problem.
The question is: how can I write my code to
work on both python2 and python3, preferably with no try..catch or version-dependent branching;
properly handle CSV files represented as HTTPResponse regardless of whether they end with \n or not?
One possible solution would be to make a custom wrapper around TextIOWrapper which would make read() return b'' when the object is closed instead of raising ValueError. But is there any better solution, without such hacks?
Looks like this is an interface mismatch between urllib3.HTTPResponse and file objects. It is described in this urllib3 issue #1305.
For now there is no fix, hence I used the following wrapper code which seemingly works fine:
class ResponseWrapper(io.IOBase):
"""
This is the wrapper around urllib3.HTTPResponse
to work-around an issue shazow/urllib3#1305.
Here we decouple HTTPResponse's "closed" status from ours.
"""
# FIXME drop this wrapper after shazow/urllib3#1305 is fixed
def __init__(self, resp):
self._resp = resp
def close(self):
self._resp.close()
super(ResponseWrapper, self).close()
def readable(self):
return True
def read(self, amt=None):
if self._resp.closed:
return b''
return self._resp.read(amt)
def readinto(self, b):
val = self.read(len(b))
if not val:
return 0
b[:len(val)] = val
return len(val)
And use it as follows:
>>> f = get_file()
>>> r = csv.reader(ResponseWrapper(io.TextIOWrapper(io.BufferedReader(f))))
>>> list(r)
The similar fix was proposed by urllib3 maintainers in the bug report comments but it would be a breaking change hence for now things will probably not change, so I have to use wrapper (or do some monkey patching which is probably worse).

Categories