I have a program in Python (Python 3 on Ubuntu 16.04) that checks for new files in a directory (.mp4 files are the result of segmenting a live video stream). I use os.listdir(path) to get the new files in my iterations. The problem is that when a new .mp4 file is created, first an empty file is created while the contents are being appended incrementally, so the file is not yet finalized/finished/playable (usually if you look at a folder, these files are shown like no extension).
Is it possible to ignore such non-finalized files at the Python level when getting the list of files in directory? Maybe some functions or API exists for that?
Using glob.glob('*.mp4', root_dir=path) should be just fine.
https://docs.python.org/3/library/glob.html
Related
I am wondering if there is an easy way to access 'parallel' directories (See photo for what I am talking about... I don't know what else to call them, please correct me if they are called something else!) from a Python file without having to input the string path.
The basic structure I intend to use is shown in the picture. The structure will be used across different computers, so I need to avoid just typing in "C:\stuff_to_get_there\parent_directory\data\file.txt" because "C:\stuff_to_get_there" will not be the same on different computers.
I want to store the .py files in their own directory, then access the data files in data directory, and save figures to figures directory. I was thinking of trying os module but not sure if that's the correct way to go.
parent directory
scripts
.py files
figures
save files here
data
.txt files stored here
Thanks for any help!
I'm trying to take an image out of a KMZ file. I currently have the local path of the kmz and the path of the file inside its respective kml file (relative, not global), and what I need is to get the path to load the file to a database. Is there a way to get it using basic string-type paths?
A KMZ is just Zip archive (with a .kmz extension instead of .zip), so you should be able to unzip it and access all the files with "zipfile" or similar.
I was having dataframe which I wrote to a CSV by using below code:
df.write.format("csv").save(base_path+"avg.csv")
As i am running spark in client mode, above snippets created a folder name avg.csv and the folder contains some file with part-*
.csv on my worker node or nested folder then file part-*.csv.
Now when I am trying to read avg.csv I am getting path doesn't exist.
df.read.format("com.databricks.spark.csv").load(base_path+"avg.csv")
Can anybody tell where am I doing wrong ?
Part-00** files are output of distributively computed files (like MR, spark). So, it will be always a folder created with part files when you try to store, as this is an output of some distributed storage which is to be kept in mind.
So, try using:
df.read.format("com.databricks.spark.csv").load(base_path+"avg.csv/*")
I'm wanting to make a little python script to check and compre the contents of two folders and all the files inside.
Cycle through folder structure based on Folder A
Compare every file from Folder A with Folder B
If the file doesn't exist or the contents is NOT 100% identical then to COPY the file to Folder C but in the same folder structure as Folder A
Could anyone advise on how to do such a feat?
I believe dircmp from filecmp does most of that for you:
https://docs.python.org/2/library/filecmp.html
You can just extend the basic example in this page. By using the attributes left_only, right_only and diff_files you can easily identify missing and not 100% idendical files.
I have a script that i use to push files back to my home PC using rsync. File names successfully pushed are added to a sqlite database so they don't get pushed again ( since i only want 1 way mirror ). Anyhow, the problem that i have is that although the script recursively goes down the source path and push files based on a defined extension, the files go down the same destination root directory.
What i am trying to is to have the destination folder structure the same as the source.
I think i have to do add something to the destDir path, but not exactly sure what:
for root, dirs, files in os.walk(sourceDir):
for file in files:
//If some filtering criteria
print("Syncing new file: "+file)
cmd=["rsync"]
cmd.append(os.path.join(root, file))
cmd.append(destDir+ "/")
p=subprocess.Popen(cmd,shell=False)
if p.wait()==0:
rememberFile(file)
I think you should rely on the features of rsync for this as much as possible, rather than trying to reimplement it in Python. rsync has been extensively tested and is full-featured. They've fixed all of the bugs that you're encountering. For instance, in your original code snippet, you need to reconstruct the full path of your file (instead of just the filename) and add that to your destDir.
But before you keep debugging that, consider this alternative. Instead of a sql db, why not keep all of the files that you have pushed in a plain text file? Let's say it's called exclude_list.txt. Then your one-liner rsync command is:
rsync -r --exclude-from 'exclude_list.txt' src dst
The -r switch will cause it to traverse the file tree automatically. See topic #6 on this page for more details on this syntax.
Now you only need your Python script to maintain exclude_list.txt. I can think of two options:
Capture the output of rsync with the -v option to list the filenames that were moved, parse them, and append to exclude_list.txt. I think this is the most elegant solution. You can probably do it in just a few lines.
Use the script you already have to traverse the tree and add all of the files to exclude_list.txt, but remove all of the individual rsync calls. Then call rsync once at the end, as above.