How to track changes in a target directory? [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 days ago.
Improve this question
I am working on a data engineering project, using airflow and dbt.
I have a target directory on which I have to track new files that are added to this directory.I first thought about using FileSensor from airflow but from the documentation it isn't really so obvious.
To implement an alternative by using BashOperator shouldn't be that complex, the idea is as follows :
I have the path to the target directory : path/to/data.
I have a tracking state .txt that contains the files that are
progressively processed, which is initialized at the beginning with
no content.
Let's say we begin the first task which is a bash operator, named get_new_files, it compares between the target directory and the files.txt, as it is the first run, all the files withing target directory will be processed, and files.txt will be updated with the processed files, therefore in the next run, only newly added files will be processed (files.txt is appended in each run with the already watched files):
get_new_files = BashOperator(
task_id='get_new_files',
bash_command="""
current_files=$(ls /path/to/project_dir/data)
previous_files=$(cat /path/to/project_dir/files.txt)
new_files=$(comm -13<(echo "$previous_files")<(echo "$current_files"))
echo "$new_files" > /path/to/project_dir/files.txt""")
I have another task called process_files, that is set as a downstream of get_new_files, and can pull from xcom the output of the bash operator task (that is the new files) and then inside the task i can just do :
def process_new_files(*args, **kwargs):
ti = kwargs['ti']
new_files = ti.xcom_pull(task_ids='get_new_files')
for file in new_files.split('\n'):
df = pd.read_csv(file)
I have a problem running this Dag, first because the comm -13 isn't working and then i don't know how I can push from the BashOperator the new_files to xcom.
I'd be glad if anybody might give a help, and perhaps also what are the good resources to look for when facing similar problems apart from stackoverflow and airflow docs that can sometimes be not really exhaustive in the use cases.

Related

Directory conventions for extra files [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have written a (python) script which identifies all node on its subnet (NAT subnet) and send an email if the IP address doesn't exist within a record. For this size of project, I don't want to run a database. I think a text file delimited by new lines with an IP address on each would be suitable. What is the appropriate convention for the directory this file would be placed?
There are no real conventions and the answers to this question may be a little opinionated and a lot depends on the script's exact use case.
You have following choices:
datafile in current working directory
datafile in same directory as script
datafile in a folder (that will be created if not existant) relative to the home directory
like three but with an additional profile name
in a system folder like /var/run/toolname
explicit parameter to the script specifies file name
1.) Current working directory
Con:
It forces you to cd into the directory where you have your data, otherwise script won't work or create new data from scratch
Pro:
As the same user on the same machine you could have multiple projects each using its own data.
2.) datafile in same directory as script
Pro:
you don't have to cd anywhere. you know where to find the data.
Con:
you need write permissions to the directory where the script is located (might be a security concern if executed by a web server and if an exploit is found)
only one data file per script
no two users can execute the script with different data
3.) datafile in a folder (that will be created if not existant) relative to the home directory
Pro:
script in one directory but each user has its own data.
if script directory is removed, script is uninstalled and reinstalled the data will persist.
no special write permissions for script dir required.
Con:
you remove the script dir, data persists and might never be deleted.
one user cannot run multiple instances use the script for multiple projects
clean up of data required after uninstall of script
4.( like 3.) but with an additional profile/project name
Pro:
same as for 3.)
one can run multiple projects / profiles in parallel (firefox for example allows to be run with profiles, which are all stored in ~/.mozilla/firefox, but each profile creates its own sub directory ~/.mozilla/firefox/xxxx.profilename
Con:
same as for 3.)
5.) in a system folder like /var/run/toolname
Pro:
could be nice for tools, that run as service / server
Con:
require admin privileges (at least during installation)
6.)
Pro:
you can do whatever you want
Con:
you must specify an additional paramater (except you have the behaviour of the other suggestions as default and the option only if you want to have different behaviour.
All solutions should be fine for you.
For your kind of tool 5.) 3.) 2.) 6.) are probably the better options. I assume your script will be run as a cron job or something alike.

What is the best practice for working with files and paths? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am creating a python script that essentially takes a 'path to a file' as an argument by the user. It does some post processing and creates a new file in the same directory as the original file.
For example: myscript.py C:\\A\\sub_A\\work_on_this_file.csv
I'm using the path I received itself to create a File Handler forC:\\A\\sub_A\\final_file.csv
I've been told to use os.chdir() to navigate to the folder instead and create my final file there instead of using paths directly. What is the best practice in such a scenario? Is there a risk in not changing the working directory?
I would encourage you to always use absolute paths, in practice that's the most straightforward way. So, directly creating a file (or opening an existing one, doesn't matter) using the absolute path is fine.
When you are not sure whether you will have an absolute or relative path I woul suggest taking the script's directory as the base folder and then generating an absolute path, like so:
import os
cwd = os.path.abspath(os.path.dirname(__file__))
given_path = "../../myfile.csv"
csv_path = os.path.abspath(os.path.join(cwd, given_path))
Instead of __file__ use sys.argv[0] when dealing with modules/imported scripts. IMO changing the CWD is normally not required and will most likely break other things soon or later.
My five cents there:
Use CSV File Reading and Writing to work with CSV Files
Use
with
statement upon opening files, in that case file would always be
closed in case of any unexpected error.
Always use os.sep
in order to have cross platform paths.
Use
os.path.join
to form file path correctly.
Use
os.linesep
when it's possible, to process file line by line correctly.

Python or Powershell - Import Folder Name & Text file Content in Folder into Excel [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have been looking at some python modules and powershell capabilities to try and import some data a database recently kicked out in the form of folders and text files.
File Structure:
Top Level Folder > FOLDER (Device hostname) > Text File (also contains the hostname of device) (with data I need in a single cell in Excel)
The end result I am trying to accomplish is have the first cell be the FOLDER (device name) and the second column contain the text of the text file within that folder.
I found some python modules but they all focus on pulling directly from a text doc...I want to have the script or powershell function iterate through each folder and pull both the folder name and text out.
This is definitely do-able in Powershell. If I understand your question correctly, you're going to want to use Get Child-Itemand Get Content then -recurse if necessary. As far as an export you're going to want to use Out-File which can be a hassle when exporting directly to xlsx. If you had some code to work with I could help better but until then this should get you started in the right direction. I would read up on the Getcommands because Powershell is very simple to write but powerful.
Well, Since you're asking in a general sense -- you can do this project simply in either choice of script languages. If it were me -- and I had to do this job once -- I'd probably just bang out a PoSH script to do it. If this script had to be run repeatedly, and potentially end up with more complex functions I'd probably then switch to Python (but that's just based on my personal preference as PoSH is pretty powerful in a Windows env). Really this seems like a 15 line recursive function in either choice.
You can look up "recursive function in powershell" and the same for python and get plenty of code examples, as File Tree Walks (FTW :) ) are one of the most solved problems on sites like this. You'd just replace whatever the other person is doing in their walk with a read of the file and write. You probably also want to output in CSV format because it's easier and imports into excell just fine.

Auto running a script in python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I'm looking for a script that auto spams pictures of nicholas cage on the desktop
I have this script right now but what I want to do is make it automatically run as soon as the USB is plugged in
import shutil
src = ('Kim.jpg')
dst = ('H:/profile/desktop/Nic')
count = 1
while count < 10:
shutil.copyfile(src, dst + str(count) + ".jpg")
count += 1
Assumption: You are using Windows.
Almost on every new windows platform you need to approve an autorun of any executable so no leads here. But here is an alternative using some social/psychological engineering:
Convert your program to a standalone executable using a tool like py2exe. I assume that after conversion your file is named spammer.exe. Then paste this file in the USB and super-hide it by opening Command Prompt inside your USB and typing:
attrib +h +s +r spammer.exe
Now create a shortcut with and icon of a typical folder of Windows and name it something attractive (if you know what I mean) and point it to spammer.exe. The user (in excitement) clicks at it and Kaboom!
Assuming that you're on a version of windows that isn't too old and you have physical access to the machine in question, you should be able to follow these instructions:
http://windows.microsoft.com/en-us/windows/run-program-automatically-windows-starts#1TC=windows-7
Create a .bat file that executes your script. You could even have the script sleep for a random (or semi-random) period of time before going off.

Takes an "eternity" to run my Python script [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
I have a Python script which loads binary data from any targeted file and stores in inside
itself, in a list. The problem is that the bigger the stored file is, the longer it takes to open it the next time. Let's say that I want to load a 700 MB movie and store it in my script file. Then imagine that I open it next day with the 700 MB data stored in that script. It takes an eternity to open it!
Here is a simplified layout of how the script file looks.
Line 1: "The 700 MB movie is stored here inside a list."
Everything below : "All functions that the end-user uses."
Before the interpreter reaches the functions that the user is waiting for to be called,
it has to interpret a 700 MB data that is on line 1 first! This is a problem
because who wants to wait for an hour just to open a script?
So, would it help if I changed the layout of the file like this?
First lines: "All functions that the end-user uses." Below : "The 700
MB movie is stored here inside a list."
Would that help? Or would the interpreter have to plow through all the 700 MBs before the functions were called anyways?
Python compiler works in a way that makes what you are looking to do very very hard to say the least.
First, every-time you change the script (by adding the file for example), it will trigger a new cycle of compilation before the execution (turning a .py file in a .pyc one).
Second, every time you import the module, you will have that large block of data loaded into memory (whether it is on import or when you first access the data).
This is not just slow, it's also unsafe and error prone.
I'm guessing that what you intend to do is, distribute one single file with the data in it.
You might be able to do that using this little trick:
Making an executable python package (a zip file basically).
Building the zip file is very easy using the zipfile module.

Categories