EC2 instance very slow after EBS volume resize - python

I have an instance on EC2 that is a t2.medium and it was with a 75GB gp2 hard drive( which is a general purpose SSD). After change to a 110GB gp2 hard drive the whole machine is really slow..
My python script used to take something like 40 to 60 seconds to uncompressed some zip file and now its taking 3 to 5 minutes..
If multithreading of this script is running it takes for ever..
Any idea of that happened or how to solve it?
Windows running there..

When you "resized" the disk volume what you really did was create a new larger EBS volume from a snapshot of the old volume. The new EBS volume becomes available immediately but you have to go through an "initialization" process to get it to load all the data. The first time you access a particular block of data on the new volume it will be slow. Subsequent attempts to access that block of data will occur at the fast speed that you would expect. You can read more about this here.

Related

long-running python program ram usage

I am currently working on a project where a python program is supposed to be running for several days, essentially in an endless loop until an user intervenes.
I have observed that the ram usage (as shown in the windows task manager) rises - slowly, but steadily. For example from ~ 80 MB at program start to ~ 120 MB after one day. To get a closer look at this, I started to log the allocated memory with
tracemalloc.get_traced_memory() at regular intervalls throughout the program execution. The output was written to the time series db (see image below).
tracemalloc output for one day runtime
To me it looks like the memory that is needed for the program does not accumulate over time. How does this fit in the output of the windows task manager? Should I go through my program to search for growing data structures?
Thank your very much in advance!
Okay, turns out the answer is: no, this is not proper behaviour, the ram usage can stay absolutely stable. I have tested this for three weeks now and the ram usage never exceeded 80 mb.
The problem was in the usage of the influxdb v2 client.
You need to close both the write_api (implicitly done with the "with... as write_api:" statement) and the client itself (explicitly done via the "client.close()" in the example below).
In my previous version that had increasing memory usage, I only closed the write_api and not the client.
client = influxdb_client.InfluxDBClient(url=self.url, token=self.token, org=self.org)
with client.write_api(write_options=SYNCHRONOUS) as write_api:
# force datatypes, because influx does not do fluffy ducktyping
datapoint = influxdb_client.Point("TaskPriorities")\
.tag("task_name", str(task_name))\
.tag("run_uuid", str(run_uuid))\
.tag("task_type", str(task_type))\
.field("priority", float(priority))\
.field("process_time_h", float(process_time))\
.time(time.time_ns())
answer= write_api.write(bucket=self.bucket, org=self.org, record=datapoint)
client.close()

PyTorch model take too much to load the first time in a new machine

I have a manual scaling set-up on EC2 where I'm creating instances based on an AMI which already runs my code at boot (using Systemd). I'm facing a fundamental problem: on the main instance (the one I use to create the AMI, the Python code takes 8 seconds to be ready after the image is booted, this includes importing libraries, loading state dicts of models, etc...). Now, on the images I create with the AMI, the code takes 5+ minutes to boot up the first time, it takes especially long to load the state dicts from disk to GPU memory, after the first time the code takes about the same as the main instance to load.
The AMI keeps the same pycache folders as the main instance, so it shouldn't take that much time since I think the AMI should include everything, shouldn't it?. So, my question is: Is there any other caching to make CUDA / Python faster that I'm not taking into consideration? I'm only keeping the pycache/ folders, but I don't know if there's anything I could do to make sure it doesn't take that much time to boot everything the first time. This is my main structure:
# Import libraries
import torch
import numpy as np
# Import personal models (takes 1 minute)
from model1 import model1
from model2 import model2
# Load first model
model1_object = model1()
model2_object = model2()
# Load state dicts (takes 3+ minutes, the first time in new instances, seconds other times)
# Note: the models are a bit heavy
model1_object.load_state_dict(torch.load("model1.pth"))
model2_object.load_state_dict(torch.load("model2.pth"))
Note: I'm using g4dn.xlarge instances, for both the main instance and for newer ones in AWS.
This was caused because of the high latencies required while restoring AWS EBS snapshots. At first when you restore a snapshot, the latency is extremely high, explaining why the model takes so much to load in my example when the instance is freshly created.
Check the initialization section of this article: https://cloudonaut.io/ebs-snapshot-pitfalls/
The only solution that I've found to use an instance fast when it is first created is to enable Fast Snapshot Restore, which costs around 500$ a month: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-fast-snapshot-restore.html
If you have time to spare, you can wait until the maximum performance is achieved, or try to warm the volume up beforehand https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html

How to find time period over which usage was minimum over a day through database

I'm working on a project where I monitor pressure inside a pipe once every 5 seconds in a day and store it over memory in MCU or on the cloud (this still has to be decided.) After 24 Hours I need to calculate the minimum usage over a time period from that data. So I would like if people can suggest me if there is any smart algorithm that I can read and apply it. I have low knowledge of databases so I will appreciate your views.
We are going to use python and I am going to code on the raspberry pi.
PS: I am noob in database algorithm so please keep that in mind.
Here is a process flow which should give you the minimum use period and also give you a data record.

How can Python test the availability of a NAS drive really quickly?

For a flood warning system, my Raspberry Pi rings the bells in near-real-time but uses a NAS drive as a postbox to output data files to a PC for slower-time graphing and reporting, and to receive various input data files back. Python on the Pi takes precisely 10 seconds to establish that the NAS drive right next to it is not currently available. I need that to happen in less than a second for each access attempt, otherwise the delays add up and the Pi fails to refresh the hardware watchdog in time. (The Pi performs tasks on a fixed cycle: every second, every 15 seconds (watchdog), every 75 seconds and every 10 minutes.) All disc access attempts are preceded by tests with try-except. But try-except doesn't help, as tests like os.path.exists() or with open() both take 10 seconds before raising the exception, even when the NAS drive is powered down. It's as though there's a 10-second timeout way down in the comms protocol rather than up in the software.
Is there a way of telling try-except not to be so patient? If not, how can I get a more immediate indicator of whether the NAS drive is going to hold up the Pi at the next read/write, so that the Pi can give up and wait till the next cycle? I've done all the file queueing for that, but it's wasted if every check takes 10 seconds.
Taking on MQTT at this stage would be a big change to this nearly-finished project. But your suggestion of decoupling the near-real-time Python from the NAS drive by using a second script is I think the way to go. If the Python disc interface commands wait 10 seconds for an answer, I can't help that. But I can stop it holding up the time-critical Python functions by keeping all time-critical file accesses local in Pi memory, and replicating whole files in both directions between the Pi and the NAS drive whenever they change. In fact I already have the opportunistic replicator code in Python - I just need to move it out of the main time-critical script into a separate script that will replicate the files. And the replicator Python script will do any waiting, rather than the time-critical Python script. The Pi scheduler will decouple the two scripts for me. Thanks for your help - I was beginning to despair!

How can I force Python code to read input files again without rebooting my computer

I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.
This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache
I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.

Categories