Context
I'm working on a Data Science Project in which I'm running a data analysis task on a dataset (let's call it original dataset) and creating a processed dataset (let's call this one result). The last one can be queried by a user by creating different plots through use of a Dash application. The system also makes some predictions on an attribute of this dataset thanks to ML models. Everything will work on an external VM of my company.
What is my current "code"
Currently I have these python scripts that create the result dataset (except the Dashboard one):
concat.py (simply concatenates some files)
merger.py (merges different files in the project directory)
processer1.py (processes the first file needed for the analysis)
processer2.py (processes a second file needed for the analysis)
Dashboard.py (the Dash application)
ML.py (runs a classic ML task, creates a report and an updated result dataset with some predictions)
What I should obtain
I'm interested in creating this kind of solution that will run the VM:
Dashboard.py runs 24/7 based on the existence of the "result" dataset, without it it's useless.
Every time there's a change in the project directory (new files every month are added), the system triggers the execution of concat.py, merger.py, processer1.py and processer2.py. Maybe a python script and the watchdog package can help to create this trigger mechanism? I'm not sure.
Once the execution above is done, the ML.py file is executed based on the "result" dataset and it's uploaded to the dashboard.
The Dashboard.py it's restarted with new csv file.
I would like to receive some help to understand what are the technologies necessary to get what I would like. Something like an example or maybe a source, so I can fully understand and apply what is right. I know that maybe I have to use a python script to orchestrate the whole system, maybe the same script that observes the directory or maybe not.
The most important thing is that the dashboard operates always. This is what creates the need of running things simultaneously. Just when the "result" csv dataset is completed and uploaded it is necessary to restart it, I think that for the users is best to keep the service continuity.
The users will feed the dashboard with new files in the observed directory. It's necessary to create automation by using "triggers" to execute the code, since they are not skilled users and they will not be allowed to use the VM bash (I suppose). Maybe I could think about creating a repetitive execution instead, like every month.
Company won't let me grant another VM or similar if it's needed, so I should do it just with a single VM.
Premise
This is the first time that I have to get "in production" something, and I have no experience at all. Could anyone help me to get the best approach? Thanks in advance.
Related
i am writing a very simple ETL(T) pipline currently:
look at ftp if new csv files exist
if yes than donwload them
Some initial Transformations
bulk insert the individual CSVs into a MS sql DB
Some additional Transformations
There can be alot of csv files. The srcript runs ok for the moment, but i have no concept of how to actually create a "managent" layer around this. Currently my pipeline runs linear. I have a list of the filenames that need to be loaded, and ( in a loop) i load them into the DB.
If something fails the whole pipeline has to rerun. I do not manage the state of the pipleine ( i.e. has an specific file already been downloaded and transformed/changed?).
There is no way to start from an intermediate point. How cold i break this down into individual taks that need to be performedß
I rougly now of tools like Airflow, but i feel that this is only a part of the necessary tools, and frankly i am to uneducated in this area to even ask the right questions.
It would be really nice if somebody could point me in the right direction of what i am missing and what tools are available.
Thanks in advance
I´m actually using Airflow to run etl-pipelines with similar steps described by you.
The whole workflow can be partitioned into single tasks. For almost every task Airflow provides an operator.
For
look at ftp if new csv files exist
u could use a file sensor with underlying ftp-connection
FileSensor
For
if yes than donwload them
you could use the BranchPythonOperator.
BranchPythonOperator
All succeding tasks could be wrapped into a .py function and then be executed via the PythonOperator.
Would definitely recommend using Airflow, but if you are looking for alternatives, there are plenty:
airflow-alternatives
I'm relatively new to Python so was wondering if anyone can give some hints or tips regarding something I'm wanting to do using Python whilst being run as part of a build on a Jenkins Pipeline.
To give a basic breakdown I'm wanting to export/save timestamps from the Jenkins Output, which current timestamps all commands/strings that happen within it, whilst it is running a build to either a .txt file or .csv file. These timestamp will be taken when specific commands/strings occur in the Jenkins output. I've given an example below for the Timestamp and Command being looked for.
"2021-08-17 11:46:38,899 - LOG: Successfully sent the test record"
I'd prefer just to send the timestamp itself, but if the full line needs to be sent then that would work as well, as the there is a lot of information generated in the console that isn't of interest for what I want to do.
My ultimate goal would be to do this for multiple different and unique commands/strings that occur in the Jenkins output. Along with this, some testing I’d be doing would involve running the same script over and over for a set number of loops, so I’d want the timestamp data to be saved into a singular output file (and not overwritten) or in separate output files for each loop.
Any hints or tips for this would be greatly appreciated as I’ve reached a dead end on what I can search up online involving the use of either the Logging function, using the wait_for_value function to find the required command/string in the console output and then save/print it to a created variable or seeing if Regex would be suitable for the task.
Thanks in advance for any help on this.
I'm trying to write a script which can download the outputs from an Azure ML experiment Run after the fact.
Essentially, I want to know how I can get a Run by its runId property (or some other identifier).
I am aware that I have access to the Run object when I create it for the purposes of training. What I want is a way to recreate this Run object later in a separate script, possibly from a completely different environment.
What I've found so far is a way to get a list of ScriptRun objects from an experiment via the get_runs() function. But I don't see a way to use one of these ScriptRun objects to create a Run object representing the original Run and allowing me to download the outputs.
Any help appreciated.
I agree that this could probably be better documented, but fortunately, it's a simple implementation.
this is how you get a run object for an already submitted run for azureml-sdk>=1.16.0 (for the older approach see my answer here)
from azureml.core import Workspace
ws = Workspace.from_config()
run = ws.get_run('YOUR_RUN_ID')
once you have the run object, you can call methods like
.get_file_names() to see what files are available (the logs in azureml-logs/ and logs/azureml/ will also be listed)
.download_file() to download an individual file
.download_files() to download all files that match a given prefix (or all the files)
See the Run object docs for more details.
I am currently running evaluations with multiple parameter configurations in a medium sized project.
I set certain parameters and change some code parts and run the main file with python.
Since the execution will take several hours, after starting it I make changes to some files (comment out some lines and change parameter) and start it again in a new tmux session.
While doing this, I observed behaviour, where the first execution will use configuration options of the second execution, so it seems like python was not done parsing the code files or maybe lazy loads them.
Therefore I wonder how python loads modules / code files and if changing them after I started the execution will have an impact on the execution?
I want to automate the entire process of creating ngs,bit and mcs files in xilinx and have these files be automatically be associated with certain folders in the svn repository. What I need to know is that is there a log file that gets created in the back end of the Xilinx gui which records all the commands I run e.g open project,load file,synthesize etc.
Also the other part that I have not been able to find is a log file that records the entire process of synthesis, map,place and route and generate programming file. Specially record any errors that the tool encountered during these processes.
If any of you can point me to such files if they exist it would be great. I haven't gotten much out of my search but maybe I didn't look enough.
Thanks!
Well, it is definitely a nice project idea but a good amount of work. There's always a reason why an IDE was built – a simple search yields the "Command Line Tools User Guide" for various versions of Xilinx ISE, like for 14.3, 380 pages about
Overview and list of features
Input and output files
Command line syntax and options
Report and message information
ISE is a GUI for various command line executables, most of them are located in the subfolder 14.5/ISE_DS/ISE/bin/lin/ (in this case: Linux executables for version 14.5) of your ISE installation root. You can review your current parameters for each action by right clicking the item in the process tree and selecting "Process properties".
On the Python side, consider using the subprocess module:
The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.
Is this the entry point you were looking for?
As phineas said, what you are trying to do is quite an undertaking.
I've been there done that, and there are countless challenges along the way. For example, if you want to move generated files to specific folders, how do you classify these files in order to figure out which files are which? I've created a project called X-MimeTypes that attempts to classify the files, but you then need a tool to parse the EDA mime type database and use that to determine which files are which.
However there is hope, so to answer the two main questions you've pointed out:
To be able to automatically move generated files to predetermined paths. From what you are saying it seems like you want to do this to make the versioning process easier? There is already a tool that does this for you based on "design structures" that you create and that can be shared within a team. The tool is called Scineric Workspace so check it out. It also have built in Git and SVN support which ignores things according to the design structure and in most cases it filters all generated things by vendor tools without you having to worry about it.
You are looking for a log file that shows all commands that were run. As phineas said, you can check out the Command Line Tools User guides for ISE, but be aware that the commands to run have changed again in Vivado. The log file of each process also usually states the exact command with its parameters that have been called. This should be close to the top of the report. If you look for one log file that contains everything, that does not exist. Again, Scineric Workspace supports evoking flows from major vendors (ISE, Vivado, Quartus) and it produces one log file for all processes together while still allowing each process to also create its own log file. Errors, warning etc. are also marked properly in this big report. Scineric has a tcl shell mode as well, so your python tool can run it in the background and parse the complete log file it creates.
If you have more questions on the above, I will be happy to help.
Hope this helps,
Jaco