opening files in async-await functions python

opening files in async-await functions python - python

Am using discord.py which requires using async-await functions.
I want to dump and load data using pickle and JSON modules.
but when trying that I get this error
AttributeError: __enter__
d:\Users\-------\visual studio projects\------------\main.py:65: RuntimeWarning:
coroutine 'Command.__call__' was never awaited
I believe this happened because am opening the file inside and async-await function.
so I tried an alternative way to open the files with async functions with aiofiles:
async with aiofiles.open("owners.pkl", mode="rb") as file:
owner_dict = pickle.load(file)
but the problem is that pickle and json does work inside async functions.
Is there any alternative way to open, load, and dump with JSON or pickle inside async-await functions ???

The thing returned by aiofiles.open is not a regular file-like object, its operations need to be awaited:
async with aiofiles.open("owners.pkl", mode="rb") as file:
owner_dict = pickle.loads(await file.read())
Then again, this doesn't really help all that much since the deserialization will still happen in a blocking way (only reading the file will be async).
And a general note: Even if some interface demands an async function, there's no restriction on what happens inside of it. You can just write async in front of a normal, blocking function, and it will just work (without the benefits of async/await of course).

Related

Python: capturing all writes to a file in memory

Is there some way to "capture" all attempted writes to a particular file /my/special/file, and instead write that to a BytesIO or StringIO object instead, or some other way to get that output without actually writing to disk?
The use case is: there's a 'handler' function, whose contract is that it should write its output to /my/special/file. I don't have any control over this handler function -- I don't write it, I don't know its contents and I can't change its contents, and the contract cannot change. I'd like to be able to do something like this:
# 'output' has whatever 'handler' has written to `/my/special/file`
output = handler.run(data)
Even if this is an odd request, I'd like to be able to do this even with a 'hackier' answer.
EDIT: my code (and handler) will be invoked many times on a lot of chunks of data, so performance (both latency and throughput) are important.
Thanks.

If you're talking about code in your own Python program, you could monkey-patch the built in open function before that code gets called. Here's a really stupid example, but it shows that you can do this. This causes code that thinks it's writing to a file to instead write into an in-memory buffer. The calling code then prints what the foreign code wrote to the file:
import io
# The function you don't have access to that writes to a file
def foo():
f = open("/tmp/foo", "w")
f.write("blahblahblah\n")
f.close()
# The buffer to contain the captured text
capture_buffer = ""
# My silly file-like object that only handles write(str) and close()
class MyFileClass:
def write(self, str):
global capture_buffer
capture_buffer += str
def close(self):
pass
# patch open to return a MyFileClass instance
def my_open2(*args, **kwargs):
return MyFileClass()
open = my_open2
# Call the target function
foo()
# Print what the function wrote to "the file"
print(capture_buffer)
Result:
blahblahblah
Sorry for not spending more time with this. Just showing you it's possible. As others say, a mocking module might be the way to go to not have to grow your own thing here. I don't know if they allow access to what is written. I guess they must. Such a module is just going to do a better job of what I've shown here.
If your program does other file IO with open, or whichever method the mystery code uses to open the file, you'd check the incoming path and only return your special object if it was the one path you're interested in. Otherwise, you could just call the original open, which you could stash away under another name.

Why should asyncio.StreamWriter.drain be explicitly called?

From doc:
write(data)
Write data to the stream.
This method is not subject to flow control. Calls to write() should be followed by drain().
coroutine drain()
Wait until it is appropriate to resume writing to the stream. Example:
writer.write(data)
await writer.drain()
From what I understand,
You need to call drain every time write is called.
If not I guess, write will block the loop thread
Then why is write not a coroutine that calls it automatically? Why would one call write without having to drain? I can think of two cases
You want to write and close immediately
You have to buffer some data before the message it is complete.
First one is a special case, I think we can have a different API. Buffering should be handled inside write function and application should not care.
Let me put the question differently. What is the drawback of doing this? Does the python3.8 version effectively do this?
async def awrite(writer, data):
writer.write(data)
await writer.drain()
Note: drain doc explicitly states the below:
When there is nothing to wait for, the drain() returns immediately.
Reading the answer and links again, I think the functions work like this. Note: Check accepted answer for more accurate version.
def write(data):
remaining = socket.try_write(data)
if remaining:
_pendingbuffer.append(remaining) # Buffer will keep growing if other side is slow and we have a lot of data
async def drain():
if len(_pendingbuffer) < BUF_LIMIT:
return
await wait_until_other_side_is_up_to_speed()
assert len(_pendingbuffer) < BUF_LIMIT
async def awrite(writer, data):
writer.write(data)
await writer.drain()
So when to use what:
When the data is not continuous, Like responding to an HTTP request. We just need to send some data and don't care about when it is reached and memory is not a concern - Just use write
Same as above but memory is a concern, use awrite
When streaming data to a large number of clients (e.g. some live stream or a huge file). If the data is duplicated in each of the connection's buffers, it will definitely overflow RAM. In this case, write a loop that takes a chunk of data each iteration and call awrite. In case of a huge file, loop.sendfile is better if available.

From what I understand, (1) You need to call drain every time write is called. (2) If not I guess, write will block the loop thread
Neither is correct, but the confusion is quite understandable. The way write() works is as follows:
A call to write() just stashes the data to a buffer, leaving it to the event loop to actually write it out at a later time, and without further intervention by the program. As far as the application is concerned, the data is written in the background as fast as the other side is capable of receiving it. In other words, each write() will schedule its data to be transferred using as many OS-level writes as it takes, with those writes issued when the corresponding file descriptor is actually writable. All this happens automatically, even without ever awaiting drain().
write() is not a coroutine, and it absolutely never blocks the event loop.
The second property sounds convenient - you can call write() wherever you need to, even from a function that's not async def - but it's actually a major flaw of write(). Writing as exposed by the streams API is completely decoupled from the OS accepting the data, so if you write data faster than your network peer can read it, the internal buffer will keep growing and you'll have a memory leak on your hands. drain() fixes that problem: awaiting it pauses the coroutine if the write buffer has grown too large, and resumes it again once the os.write()'s performed in the background are successful and the buffer shrinks.
You don't need to await drain() after every write, but you do need to await it occasionally, typically between iterations of a loop in which write() is invoked. For example:
while True:
response = await peer1.readline()
peer2.write(b'<response>')
peer2.write(response)
peer2.write(b'</response>')
await peer2.drain()
drain() returns immediately if the amount of pending unwritten data is small. If the data exceeds a high threshold, drain() will suspend the calling coroutine until the amount of pending unwritten data drops beneath a low threshold. The pause will cause the coroutine to stop reading from peer1, which will in turn cause the peer to slow down the rate at which it sends us data. This kind of feedback is referred to as back-pressure.
Buffering should be handled inside write function and application should not care.
That is pretty much how write() works now - it does handle buffering and it lets the application not care, for better or worse. Also see this answer for additional info.
Addressing the edited part of the question:
Reading the answer and links again, I think the the functions work like this.
write() is still a bit smarter than that. It won't try to write only once, it will actually arrange for data to continue to be written until there is no data left to write. This will happen even if you never await drain() - the only thing the application must do is let the event loop run its course for long enough to write everything out.
A more correct pseudo code of write and drain might look like this:
class ToyWriter:
def __init__(self):
self._buf = bytearray()
self._empty = asyncio.Event(True)
def write(self, data):
self._buf.extend(data)
loop.add_writer(self._fd, self._do_write)
self._empty.clear()
def _do_write(self):
# Automatically invoked by the event loop when the
# file descriptor is writable, regardless of whether
# anyone calls drain()
while self._buf:
try:
nwritten = os.write(self._fd, self._buf)
except OSError as e:
if e.errno == errno.EWOULDBLOCK:
return # continue once we're writable again
raise
self._buf = self._buf[nwritten:]
self._empty.set()
loop.remove_writer(self._fd, self._do_write)
async def drain(self):
if len(self._buf) > 64*1024:
await self._empty.wait()
The actual implementation is more complicated because:
it's written on top of a Twisted-style transport/protocol layer with its own sophisticated flow control, not on top of os.write;
drain() doesn't really wait until the buffer is empty, but until it reaches a low watermark;
exceptions other than EWOULDBLOCK raised in _do_write are stored and re-raised in drain().
The last point is another good reason to call drain() - to actually notice that the peer is gone by the fact that writing to it is failing.

GridFS: For writing files using 'put', is explicit 'with' block necessary?

For writing large files in gridFS using put(), is it necessary to use context manager with ?
Looking at the documentation for put() here, calling put() is equivalent to doing,
try:
f = new_file(**kwargs)
f.write(data)
finally:
f.close()
Does that mean open and close for file is done automatically and hence do not require without explicit need to?

gridfs.GridFS.put isn't a context manager. It doesn't define __enter__ and __exit__ methods of the context management protocol.
Using it directly without some modification as a context manager will result in an AttributeError.
Using gridfs.GridFS.put as-is saves you few lines of code and more importantly having to manage opening and closing the GridFile.

Python I/O: Purpose of with?

For file I/O what is the purpose of:
with open
and should I use it instead of:
f=open('file', 'w')
f.write('foo)'
f.close()

Always use the with statement.
From docs:
It is good practice to use the with keyword when dealing with file
objects. This has the advantage that the file is properly closed after
its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally blocks.
If you don't close the file explicitly then the file object may hang around in the memory until it is garbage collected, which implicitly calls close() on the file object. So, better use the with statement, as it will close the file explicitly even if an error occurs.
Related: Does a File Object Automatically Close when its Reference Count Hits Zero?

Yes. You should use with whenever possible.
This is using the return value of open as a context manager. Thus with is used not just specifically for open, but it should be preferred in any case that some cleanup needs to occur with regards to the object (that you would normally put in a finally block). In this case: on exiting the context, the .close() method of the file object is invoked.
Another good example of a context manager "cleaning up" is threading's Lock:
lock = Lock()
with lock:
#do thing
#lock is released outside the context
In this case, the context manager is .release()-ing the lock.
Anything with an __enter__ and __exit__ method can be used as a context manager. Or, better, you can use contextlib to make context managers with the #contextmanager decoration. More here.

Basically what it is trying to avoid is this:
set things up
try:
do something
finally:
tear things down
but with the with statement you can safely, say open a file and as soon as you exit the scope of the with statement the file will be closed.
The with statement calls the __enter__ function of a class, which does your initial set up and it makes sure it calls the __exit__ function at the end, which makes sure that everything is closed properly.

The with statement is a shortcut for easily writing more robust code. This:
with open('file', 'w') as f:
f.write('foo')
is equivalent to this:
try:
f = open('file', 'w')
f.write('foo')
finally:
f.close()

Streaming upload request?

I'm implementing a simple upload handler in Python which reads an uploaded file in chunks into memory, GZips and signs them, and reuploads them to another server for long term storage. I've already devised a way to read the upload in chunks with my web server, and essentially I have a workflow like this:
class MyUploadHandler:
def on_file_started(self, file_name):
pass
def on_file_chunk(self, chunk):
pass
def on_file_finished(self, file_size):
pass
This part works great.
Now I need to upload the file in chunks to the final destination after performing my modifications to them. I'm looking for a workflow somewhat like this:
import requests
class MyUploadHandler:
def on_file_started(self, file_name):
self.request = requests.put("http://secondaryuploadlocation.com/upload/%s" %
(file_name,), streaming_upload = True)
def on_file_chunk(self, chunk):
self.request.write_body(transform_chunk(chunk))
def on_file_finished(self, file_size):
self.request.finish()
Is there a way to do this using the Python requests library? It seems that they allow for file-like upload objects which can be read, but I'm not sure exactly what that means and how to apply it for my situation. How can I provide a streaming upload request like this?

I would suggest using multiprocessing module of Python. You can use the apply_async routine in that module to upload each chunk as they are completed without affecting the other uploads. You can then put them in a temporary folder and after the upload event completion, you can sew them together.

The following answer to a similar question should solve your problem:
Q: "How to stream POST data into Python requests?"
A: Example code using queue, threading and iter() with sentinel
https://stackoverflow.com/a/40018547/19163

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

opening files in async-await functions python - python

Related

Python: capturing all writes to a file in memory

Why should asyncio.StreamWriter.drain be explicitly called?

GridFS: For writing files using 'put', is explicit 'with' block necessary?

Python I/O: Purpose of with?

Streaming upload request?

Categories

Resources