Creating a custom EC2 AMI from a qcow2 image file with Python

Creating a custom EC2 AMI from a qcow2 image file with Python - python

I'm writing a service that needs to register custom AMIs in each EC2 region based on a qcow2 image file.
I have been exploring the apache-libcloud and boto libraries, but it seems the AMI registration functions are built to create an AMI based on a running instance, and I want to base the AMI on my qcow2 image file.
If there isn't an easy solution to this problem, I'll take a complex one. If for some reason this is impossible with a qcow2 image file, I also have access to the RAW image files.

I've succeeded in doing this programmatically. My solution uses raw image files, since they are the ones that can be directly written to a disk. If you need to convert from qcow2 image files, you can do it manually with qemu-img, or see a simple Python implementation of the conversion.
An outline of my process for AMI registration based on a raw image file:
Select an AMI and corresponding AKI to use as a "utility instance". It doesn't have to be the same operating system as the image you're attempting to register. If the AMI has requiretty enabled in /etc/sudoers, you need to make sure that you request a pseudo-terminal when attempting to SSH into the node, such as with Paramiko's Channel.get_pty() method.
Spin up a utility instance based on the AMI and AKI selected. It must be EBS optimized (m1.large size instances work well with EBS) and should have a secondary EBS volume attached that is large enough for the entire uncompressed image you want to register. I use /dev/sdb for this device name.
Once the utility instance is accessible via SSH, have it write the raw image file to the secondary volume. Personally, I pull a .raw.xz file from the Internet that is the image I want to write, so my utility command is sudo sh -c 'curl RAW_XZ_URL | xzcat > /dev/xvdb. Note that in all of my experience, /dev/sdX devices are accessed as /dev/xvdX on the actual instance, but this might not be the case everywhere.
Once the utility command completes, you can destroy the utility node, assuming that you've made your /dev/sdb volume not delete upon node termination. If you haven't, just stop the node. If executing the utility command programmatically, you can use Paramiko's Channel.recv_exit_status() method to wait until the command completes, and then check for a 0 exit status indicating success.
Once the utility instance is no longer running, take a snapshot of the /dev/sdb volume.
Once the snapshot completes, you can register it as an AMI. Make sure to use the same AKI that you've been using this whole time, as well as the proper root device name (I use full disk images, so my root device name is /dev/sda rather than /dev/sda1). Amazon suggests you use hd0 pv-grub AKIs nowadays, not hd00.
One way this can all be accomplished is through the apache-libcloud and paramiko Python libraries, both pip-installable. A good example is the Fedimg library, which implements this exact method in order to automatically register new AMIs in all EC2 regions as Fedora cloud image builds complete.
When actually implementing this process, there is quite a bit of timing, exception-handling, and other "gotchas" involved. This is simply an outline of the steps one must take to resolve the challenge via my method.

Related

Is there a way to use laptop's built-in biometric sensors in python applications?

I am trying to make an application using python that registers students' attendance. I'm planning to use my laptop's fingerprint built-in fingerprint device to identify the students and register the attendance.
I've tried some web searches but I couldn't find anyway to use built-in fingerprint devices for applications with python. Do you know any way to do it.
The device that i want to use for fingerprints is Lenovo ThinkPad L540.
I managed to find some stuff like windows biometric framework but those things were to be used with other languages.
https://learn.microsoft.com/en-us/windows/win32/secbiomet/biometric-service-api-portal?redirectedfrom=MSDN

This can not be done for now. The fingerprint sensor associated with laptop/mobile can be used for authentication purpose only. Means, you can add the more number of fingerprints who are eligible to access the device. Then, device will allow any one of them to unlock the device. It will not record whose fingerprint it is. It will just say, a fingerprint is authenticated or not.
For recording the attendance, you must go with the time attendances systems. if you want to build software based attendance system with the help of scanner, then you have to go with the fingerprint scanners like mfs100, zk7500 and etc.

From what I can tell, this absolutely can be done. The following link is for a python wrapper around the Windows Biometric Framework. It is around 4 years old, but the functionality it offers still seems to work fine.
https://github.com/luspock/FingerPrint
The identify function in this wrapper prints out the Sub Factor value whenever someone places a matching finger on the scanner. In my experimentation, the returned Sub Factor is unique to each finger that is stored. In the first day you use this, you would just fill a dictionary with sub factors and student names, then that is everything you need for your use case.
Considering that this wrapper only makes use of the system biometric unit pool, the drawback here is that you have to add all of your student's fingers to your PC through the windows sign-in options, meaning they would be able to unlock it. If you are okay with that, it seems like this will suit your needs.
It would also be possible for you to disable login with fingerprint and only use the system pool for this particular use case. That would give you what you want and keep your PC safe from anyone that has their fingerprint stored in the system pool.
If you want to make use of a private pool, you would have to add that functionality to the wrapper yourself. That's totally possible, but it would be a lot of work.
One thing to note about the Windows Biometric Framework is that it requires the process calling the function to have focus. In order for me to test the wrapper, I used the command-line through the Windows Console Host. Windows Terminal doesn't work, because it doesn't properly acquire focus. You can also use tkinter and call the functions with a button.

Executing a Python program in Django with every new entry is posted in REST API

I'm fairly new to Django. I am creating an application where I am posting images from flutter application to Django REST API. I need to run a python script with the input as the image that gets posted in the API.
Does anyone have any idea about this?

The best way to handle this is a job management system (e.g. Slurm, Torque, and Oracle Grid Engine) as you can create and submit a job with every uploaded image and send the response back to the user and the job management system will process independently from the request. Celery can also work if the job won't take much time.

A simple implementation that scales well:
Upload the images in a directory uploaded
Have your script run as a daemon (controlled by systemd) looking for new files in directory uploaded
whenever it finds a new file, it moves mv it to a directory working (that way, you can run multiple instances of your script in parallel to scale up)
once your script us done with the image, it moves it to a directory finished (or where-ever you need the finished images).
That setup is very simple, and works on both a small one-machine setups with low traffic, as well as on multi-machine setups with dedicated storage and multiple worker machines that handle the image transform jobs.
It also decouples your image processing from your web backend.

What's the optimal way to store image data temporarily in a containerized website?

I'm currently working on a website where i want the user to upload one or more images, my flask backend will do some changes on these pictures and then return them back to the front end.
Where do I optimally save these images temporarily especially if there are more then one user at the same time on my website (I'm planning on containerizing the website). Is it safe for me to save the images in the folder of the website or do I need e.g. a database for that?

You should use a database, or external object storage like Amazon S3.
I say this for a couple of reasons:
Accidents do happen. Say the client does an HTTP POST, gets a URL back, and does an HTTP GET to retrieve the result. But in the meantime, the container restarts (because the system crashed; your cloud instance got terminated; you restarted the container to upgrade its image; the application failed); the container-temporary filesystem will get lost.
A worker can run in a separate container. It's very reasonable to structure this application as a front-end Web server, that pushes messages into a job queue, and then a back-end worker picks up messages out of that queue to process the images. The main server and the worker will have separate container-local filesystems.
You might want to scale up the parts of this. You can easily run multiple containers from the same image; they'll each have separate container-local filesystems, and you won't directly control which replica a request goes to, so every container needs access to the same underlying storage.
...and it might not be on the same host. In particular, cluster technologies like Kubernetes or Docker Swarm make it reasonably straightforward to run container-based applications spread across multiple systems; sharing files between hosts isn't straightforward, even in these environments. (Most of the Kubernetes Volume types that are easy to get aren't usable across multiple hosts, unless you set up a separate NFS server.)
That set of constraints would imply trying to avoid even named volumes as much as you can. It makes sense to use volumes for the underlying storage for your database, and it can make sense to use Docker bind mounts to inject configuration files or get log files out, but ideally your container doesn't really use its local filesystem at all and doesn't care how many copies of itself are running.
(Do not rely on Docker's behavior of populating a named volume on first use. There are three big problems with it: it is on first use only, so if you update the underlying image, the volume won't get updated; it only works with Docker named volumes and not other options like bind-mounts; and it only works in Docker proper and not in Kubernetes.)
Other decisions are possible given other sets of constraints. If you're absolutely sure you will never ever want to run this application spread across multiple nodes, Docker volumes or bind mounts might make sense. I'd still avoid the container-temporary filesystem.

I want to trigger a python script using a cloud function whenever a specified file is created on the google cloud storage

One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/

The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.

Generating and serving files created by a Python script on Heroku (Node/Express server)

I'm working on a site that collects textual input from a number of users and then gives all that input to a user with another role after a set amount of time. The user who is given access to all the input needs to be able to export all the text as a word document.
Right now, on my local machine, a button on the page makes a db call for all the text, and uses the fs npm module to write the correct set of input to a raw text document in a format the pyton script can understand. I then use the docx module in python to read the text and write the formatted input into the word document, saving it into the public directory in my server. I can navigate to it manually after that.
I can automate it locally by writing a simple cron job that waits for the contents of the raw text file to change, firing the python program when that happens and having the link to the word doc appear after some timeout.
My question is how would I get this to work on my heroku site? Simply having python isn't enough, because I need to install the docx module with pip. Beyond that, I still need to have a scheduled check for the raw text file to change to fire the python script. Can this be accomplished through the Procfile or some heroku addons? Is there a better way to accomplish the desired behavior of button click->Document creation->serve the file? Love to know your thoughts.

You have a few different issues to look at: 1) enabling both Python and Node and then 2) correct use of filesystem on Heroku and 3) ways to schedule the work.
For #1, you need to enable multiple build packs to get both Node.js and Python stacks in place. See https://devcenter.heroku.com/articles/using-multiple-buildpacks-for-an-app.
For #2, you need to send the files to storage service of some kind (e.g., Amazon S3) - the filesystem for your dyno is ephemeral, and anything written there will disappear after a dyno restart (which happens every 24 hours no matter what)
For #3, the simplest solution is probably the Heroku Scheduler add-on, which acts like a rudimentary cron. Remember, you don't have low-level OS access, so you need to use the Heroku-provided equivalent.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.