Want to mock the S3 data in python unit tests, update that data and assert that data - python

I have written one function in which data from s3 bucket is read and dumped into the container(which includes timestamp also) and checks the difference between current time and pod container time and setting some data to the container(in memory)
So wanted to write a test case in which wanted to mock the S3 data, that contains timestamp, want to modify that, and calling that function(mentioned in above paragraph), so that I can assert the values.
tried mocking the data, but not able to do and not sure need to mock the time as well

Related

Azure ML File Dataset mount() is slow & downloads data twice

I have created a Fie Dataset using Azure ML python API. Data under question is bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. Then, I tried to mount the dataset in AML compute instance. During this mounting process, I have observed that each parquet file has been downloaded twice under the /tmp directory of the compute instance with the following message printed as the console logs:
Downloaded path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<blob_path>/20211203.parquet is different from target path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<container_name>/<blob_path>/20211203.parquet
This log message gets printed for each parquet file which is part of the dataset.
Also, the process of mounting the dataset is very slow: 44 minutes for for ~10K parquet files each of size of 330 KB.
"%%time" command in the Jupyter Lab shows most of the time has been used for IO process?
CPU times: user 4min 22s, sys: 51.5 s, total: 5min 13s
Wall time: 44min 15s
Note: Both the Data Lake Gen 2 and Azure ML compute instance are under the same virtual network.
Here are my questions:
How to avoid downloading the parquet file twice?
How to make the mounting process faster?
I have gone through this thread, but the discussion there didn't conclude
The Python code I have used is as followed:
data = Dataset.File.from_files(path=list_of_blobs, validate=True)
dataset = data.register(workspace=ws, name=dataset_name, create_new_version=create_new_version)
mount_context = None
try:
mount_context = dataset.mount(path_to_mount)
# Mount the file stream
mount_context.start()
except Exception as ex:
raise(ex)
df = pd.read_parquet(path_to_mount)
The robust option is to download directly from AzureBlobDatastore. You need to know the datastore and relative path, which you get by printing the dataset description. Namely
ws = Workspace.from_config()
dstore = ws.datastores.get(dstore_name)
target = (dstore, dstore_path)
with tempfile.TemporaryDirectory() as tmpdir:
ds = Dataset.File.from_files(target)
ds.download(tmpdir)
df = pd.read_parquet(tmpdir)
The convenient option is to stream tabular datasets. Note that you don't control how the file is read (Microsoft converters may occasionally not work as you expect). Here is the template:
ds = Dataset.Tabular.from_parquet_files(target)
df = ds.to_pandas_dataframe()
I have executed a bunch of tests to compare the performance of FileDataset.mount() and FileDataset.download(). In my environment, download() is much faster than mount().
download() works well when the disk size of the compute is large enough to fit all the files. However, in a multi-node environment, the same data (in my case parquet files) gets downloaded to each of the nodes (multiple copies). As per the documentation:
If your script processes all the files in your dataset and the disk on
your compute resource is large enough for the dataset, the download
access mode is the better choice. The download access mode will avoid
the overhead of streaming the data at runtime. If your script accesses
a subset of the dataset or it's too large for your compute, use the
mount access mode.
Downloading data in a multi-node environment could trigger performance issues (link). In such a case, mount() might be preferred.
I have tried with TabularDataset as well. As Maciej S has mentioned, in case of TabularDataset user doesn't need to decide how data is read from the datastore (i.e. user doesn't need to select mount or download as access mode). But, with the current implementation (azureml-core 1.38.0) of TabularDataset, compute needs to have larger memory (RAM) compared to FileDataset.download() for identical set of parquet files. Looks like, the current implementation reads all the individual parquet files into pandas DataFrame (which gets saved into memory/RAM) first. Then it appends those into a single DataFrame (accessed by the API user). Higher memory might be needed for this "eager" nature of the API.

Why is appengine memcache not storing my data for the requested period of time?

I have my employees stored in appengine ndb and I'm running a cron job via the taskque to generate a list of dictionaries containing the email address of each employee. The resulting list looks something like this:
[{"text":"john#mycompany.com"},{"text":"mary#mycompany.com"},{"text":"paul#mycompany.com"}]
The list is used as source data for varing angular components such as ngTags ngAutocomplete etc. I want to store the list in memcache so the Angular http calls will run faster.
The problem I'm having is that the values stored in memcache never last for more than a few minutes even though I've set it to last 26 hours. I'm aware that the actual value stored can not be over 1mb so as an experiment I hardcoded the list of employees to contain only three values and the problem still persists.
The appengine console is telling me the job ran successfully and if I run the job manually it will load the values into memcache but they'll only stay there for a few minutes. I've done this many times before with far greater amount of data so I can't understand what's going wrong. I have billing enabled and I'm not over quota.
Here is an example of the function used to load the data into memcache:
def update_employee_list():
try:
# Get all 3000+ employees and generate a list of dictionaries
fresh_emp_list = [{"text":"john#mycompany.com"},{"text":"mary#mycompany.com"},{"text":"paul#mycompany.com"}]
the_cache_key = 'my_emp_list'
emp_data = memcache.get(the_cache_key)
# Kill the memcache packet so we can rebuild it.
if emp_data is not None:
memcache.delete(the_cache_key)
# Rebuild the memcache packet
memcache.add(the_cache_key, fresh_emp_list, 93600) # this should last for 26 hours
except Exception as e:
logging.info('ERROR!!!...A failure occured while trying to setup the memcache packet: %s'%e.message)
raise deferred.PermanentTaskFailure()
Here is an example of the function the angular components use to get the data from memcache:
#route
def get_emails(self):
self.meta.change_view('json')
emp_emails = memcache.get('my_emp_list')
if emp_emails is not None:
self.context['data'] = emp_emails
else:
self.context['data'] = []
Here is an example of the cron setting in cron.yaml:
- url: /cron/lookups/update_employee_list
description: Daily rebuild of Employee Data
schedule: every day 06:00
timezone: America/New_York
Why can't appengine memcache hold on to a list of three dictionaries for more than a few minutes?
Any ideas are appreciated. Thanks
Unless you are using dedicated memcache (paid service) the cached values can and will be evicted at any time.
What you tell memcache by specifying a lifetime is when your value becomes invalid and can therefor be removed from memcache. That however does not guarantee that your value will stay that long in the memcache, it's just a capping to a cache value's maximum lifetime.
Note: The more you put in memcache the more it is likely that other values will get dropped. Therefor you should carefully consider what data you put in your cache. You should definitely not put every value you come across in the memcache.
On a sidenote: In the projects i recently worked in, we had a - sort of - maximum cache lifetime of about a day. No cache value ever lasted longer that that, even if the desired lifetime was much higher. Interestingly enough though the cache got cleared out at about the same time every day, even including very new values.
Thus: Never rely on memcache. Always use a persistent storage and memcache for performance boosts with high volume traffic.

How to keep track of previous run times for a python script

I have a basic python program that accepts the directory of a sql script and a date (or list of dates) as command line arguments. The program executes the query in the sql script for each provided date. I'd like to be able to log the execution time of the sql query each time I run the program so I could use this data to estimate execution time given how many dates the user provided to run.
How can I store this information (query, date provided, execution time) in a way that the program can easily access it later? To clarify, I already know how to time the execution of each query but I don't know how/in what format to store this information.
Example:
$ python myscript /dir/of/script.sql -dates 20140101 20140102
script.sql 20140101 1m22s
script.sql 20140102 1m53s
I'm looking for a way to store this output information for many different sql scripts (which are located in different directories) and over a large number of executions (dates). I need to do this in such a way that I can get the information back into the program to estimate execution times for a given sql script given the execution times of its previous runs.
If the data doesn't have any value outside of an "installed" instance of the program, then install it locally. If you want other executors of the python script to get estimates based on the run times of others executions (which might actually be a valid metric, since you're really timing the server here), you could store it in some sort of metrics collection table within the database.
In the case of the latter, I would probably just save all the timings in memory that were collected and save them to the database after all the scripts were run.

aws python boto: looking for reliable way to interrupt get_contents_to_filename

I have a python function that downloads a file from S3 to some temp location on a local drive and then processes it. The download part looks like this:
def processNewDataFile(key):
## templocation below is just some temp local path
key.get_contents_to_filename(templocation)
## further processing
Here key is the AWS key for the file to download. What I've noticed is that occasionally get_contents_to_filename seems to freeze. In other parts of my code I have some solution that interrupts blocks of code (and raises an exception) if these blocks do not complete in a specified amount of time. This solution is hard to use here since files that I need to download vary in size a lot and sometimes S3 responds slower than other times.
So is there any reliable way of interrupting/timing out get_contents_to_filename that does NOT involve a hard predetermined time limit?
thanks
You could use a callback function with get_contents_to_filename
http://boto.cloudhackers.com/en/latest/ref/gs.html#boto.gs.key.Key.get_contents_to_file
The callback function needs two parameters, Bytes Sent and Total Size of the file.
You can specify the granularity (maximum number of times the callback will get called) as well although I've only used it with small files (less than 10kb) and it usually only gets called twice - once on start and once on end.
The important thing is that it will pass the size of the file to the callback function at the start of the transfer, which could then start a timer based on the size of the file.

Using cProfile or line_profile on a Python script in /cgi-bin/?

Is there a way to run cProfile or line_profile on a script on a server?
ie: how could I get the results for one of the two methods on http://www.Example.com/cgi-bin/myScript.py
Thanks!
Not sure what line_profile is. For cProfile, you just need to direct the results to a file you can later read on the server (depending on what kind of access you have to the server).
To quote the example from the docs,
import cProfile
cProfile.run('foo()', 'fooprof')
and put all the rest of the code into a def foo(): -- then later retrieve that fooprof file and analyze it at leisure (assuming your script runs with permissions to write it in the first place, of course).
Of course you can ensure different runs get profiled into different files, etc, etc -- whether this is practical also depends on what kind of access and permissions you're getting from your hosting provider, i.e., how are you allowed to persist data, in a way that lets you retrieve that data later? That's not a question of Python, it's a question of contracts between you and your hosting provider;-).

Categories