awk: fatal: cannot open file 'file' for reading (Permission denied) - python

A piece code below is part of a larger program which I am running on a remote server via a batch script with #!/bin/bash -l as its first line.
On my local machine it runs normally but on a remote server permission issues arises. What may be wrong?
The description of the code may not important to the problem, but basically the code uses awk in processing the contents of the files based on the names of the files.
Why is awk denied permission to operate on the files? When I run awk directly on a shell prompt of the remote server it works normally.
#!/usr/bin/env python
list_of_files = ["file1", "file2", "file3"]
for file in list_of_files:
awk_cmd = '''awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)) ++i; next} 1' ''' + file + " > tmp && mv tmp " + file + \
" | cat files > 'pooled_file' "
exitcode = subprocess.call(awk_cmd, shell=True)
Any help would be appreciated.

I am pretty sure it is a permissions issue since when you are landing into remote machine it is NOT landing on directory where your Input_file(s) are present, off course it will land in HOME directory of logged in user at remote server. So it is a good practice to mention file names with complete paths(Make sure file names with path you are giving are present in target location too else you could write a wrapper over it to check either files are present or not too). Could you please try following.
#!/usr/bin/env python
list_of_files = ["/full/path/file1", "/full/path/file2", "/full/path/file3"]
for file in list_of_files:
awk_cmd = '''awk '/^>/{num=split(FILENAME,array,"/");print ">" substr(array[num],1,length(array[num])) ++i; next} 1' ''' + file + " > tmp$$ && mv tmp$$ " + file + \
" | cat files > 'pooled_file' "
exitcode = subprocess.call(awk_cmd, shell=True)
I haven't tested it but I have changed it as per full path, since awk will print complete path with filename so I have changed FILENAME in your code to as per array's place, I also changed tmp temporary file to tmp$$ for safer side.

Related

How to move files on Drive folder from Google Collab?

I'm using this code to read paths inside a txt file, the code changes the extension of the paths from jpg to json.
%cd /content
eliminados = 0
with open('vuelo1.txt') as b:
for o in b:
o = o.replace("jpg", "json")
print('path:',o)
eliminados = eliminados + 1
!mv $o /content/drive/MyDrive/Banano/etiquetas_eval/Datasets_originales_inferidos/etiquetas/malas/$etiquetas/
print(eliminados)
Then I need to move the json files to another folder for which I use the following line:
!mv $o /content/drive/MyDrive/Banano/etiquetas_eval/Datasets_originales_inferidos/etiquetas/malas/$etiquetas/
where $o is the path of the json file inside the txt file and the next path is the destination folder
However I get this error:
mv: missing destination file operand after '/content/drive/MyDrive/Banano/etiquetas_eval/Datasets_originales_inferidos/ric/vuelo1/DJI_0338_2-1.json'
Try 'mv --help' for more information.
/bin/bash: line 1: /content/drive/MyDrive/Banano/etiquetas_eval/Datasets_originales_inferidos/etiquetas/malas/vuelo1/: Is a directory
path: /content/drive/MyDrive/Banano/etiquetas_eval/Datasets_originales_inferidos/ric/vuelo1/DJI_0332_1-2.json
Any idea what I'm doing wrong?
There's a newline at the end of $o, so your !mv $o /content/drive/.. is broken into 2 lines / commands:
mv /content/drive/.../DJI_0338_2-1.json
/content/drive/.../malas/vuelo1/
That's why you see 2 separate error messages.
Try replacing o = o.replace("jpg", "json") with o = o.rstrip().replace("jpg", "json")
to strip newlines.
Debugging tip: using something like print(f'path: "{o}"') makes it far easier to spot such issues; and if you are not quite sure what exactly gets sent to Bash or how vars are evaluated, test your commands with echo first:
!echo mv $o /content/drive/.../malas/$etiquetas/

Spark: CopyToLocal in Cluster Mode

I have a PySpark Script where data is processed and then converted to CSV files. As the end result should be ONE CSV file accessible via WinSCP, I do some additional processing to put the CSV files on the worker nodes together and transfer it out of HDFS to the FTP Server (I think it's called Edge Node).
from py4j.java_gateway import java_import
import os
YYMM = date[2:7].replace('-','')
# First, clean out both HDFS and local folder so CSVs do not stack up (data history is stored in DB anyway if update option is enabled)
os.system('hdfs dfs -rm -f -r /hdfs/path/new/*')
os.system('rm -f /ftp/path/new/*')
#timestamp = str(datetime.now()).replace(' ','_').replace(':','-')[0:19]
df.coalesce(1).write.csv('/hdfs/path/new/dataset_temp_' + date, header = "true", sep = "|")
# By default, output CSV has weird name ("part-0000-..."). To give proper name and delete automatically created upper folder, do some more processing
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
sc = spark.sparkContext
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(sc._jvm.Path('/hdfs/path/new/dataset_temp_' + date + '/part*'))[0].getPath().getName()
fs.rename(sc._jvm.Path('/hdfs/path/new/dataset_temp_' + date + "/" + file), sc._jvm.Path('/hdfs/path/new/dataset_' + YYMM + '.csv'))
fs.delete(sc._jvm.Path('/hdfs/path/new/dataset_temp_' + date), True)
# Shift CSV file out of HDFS into "regular" SFTP server environment
os.system('hdfs dfs -copyToLocal hdfs://<server>/hdfs/path/new/dataset_' + YYMM + '.csv' + ' /ftp/path/new')
In Client mode all works fine. But when I switch to Cluster, it gives an Error Message that the final /ftp/path/new in the CopyToLocal-Command is not found, I suppose because it is looking on the Worker Nodes and not on the Edge Node. Is there any way to overcome this? As an alternative, I thought to do the final CopyToLocal command from a batch script outside of the Spark Session, but I'd rather have it all in one script...
Instead of running the OS commands in your spark script, you can directly write the output out the ftp location. You need to provide the path to the ftp location with savemode set to overwrite. you can then run the code to rename the data after your spark script has completed.
YYMM = date[2:7].replace('-','')
df.coalesce(1).write
.mode("overwrite")
.csv('/ftp/path/new/{0}'.format(date), header = "true", sep = "|")
#run the command below in a separate step once the above code has executed.
os.system("mv /ftp/path/new/{0}/*.csv /ftp/path/new/{0}/dataset_' + YYMM + '.csv".format(date))
I have made an assumption that the ftp location is accessible by the worker nodes since you are able to run the copyToLOcal command in client mode. If the location is not accessible, you will have to write the file out the hdfs location as before and run the moving of ile and renaming of the file in a separate process/code outsideof the spark script

No such file error when changing permissions of newly created HTML file in Python

I am running a process that downloads an html file using Selenium, Chromedriver and Ubuntu and then attempts to change the permissions of that file to 777. But it fails with "no such file or directory" error.
The thing is this, because I connect to a VPN using openvpn and disconnect several times during the process, I need to run it with root access.
I therefore have this shell script that I run using sudo bash ./nameofscript.sh:
#!/bin/bash
source ~/imageTextAlgorithms/bin/activate
python ~/imageTextAlgorithms/DownloadURLs.py
After connecting to the VPN and downloading the desired file using Selenium, I run the following to save the file and change permissions:
filename = os.path.join(parentdir,"data","HTML",get_random_string(10))
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(filename)
pyautogui.hotkey('enter')
call('sudo chmod 777 ' + filename + ".html", shell=True)
call('sudo chmod -R 777 ' + filename + "_files", shell=True)
Here parentdir is the full absolute path to the ~/imageTextAlgorithms directory where my code is located, and get_random_string(n) generates a random string of n lowercase characters, and I use pyautogui in order to make the browser download all images and css when saving, as opposed to just saving the html source file.
Those calls (the call function is from subprocess) give me a no such file error, but if I make the exact same call from an OS Command line it is successful. Also, I already tried using the python function os.chmod with no success.
Why is this happening? What am I doing wrong?
As suggested by furas in comments, it was a timing issue.
It turns out that there is a wait time between pyautogui pressing enter to the save dialog box and the files being in the system, as chrome downloads everything again when you save the page.
So I changed the calls to sudo chmod for this and now it works as expected:
while True:
try:
os.chmod( filename + ".html",0o777)
os.chmod( filename + "_files",0o777)
for dirpath, dirnames, filenames in os.walk( filename + "_files"):
for dirname in dirnames:
path = os.path.join(dirpath, dirname)
os.chmod(path, 0o777)
for filename in filenames:
path = os.path.join(dirpath, filename)
os.chmod(path, 0o777)
break
except Exception as e:
sleep(1)

Track folder changes / Dropbox changes

First of: I know of pyinotify.
What I want is an upload service to my home server using Dropbox.
I will have a Dropbox's shared folder on my home server. Everytime someone else, who is sharing that folder, puts anything into that folder, I want my home server to wait until it is fully uploaded and move all the files to another folder, and removing those files from the Dropbox folder, thus, saving Dropbox space.
The thing here is, I can't just track for changes in the folder and move the files right away, because if someone uploads a large file, Dropbox will already start downloading and therefore showing changes in the folder on my home server.
Is there some workaround? Is that somehow possible with the Dropbox API?
Haven't tried it myself, but the Dropbox CLI version seems to have a 'filestatus' method to check for current file status. Will report back when I have tried it myself.
There is a Python dropbox CLI client, as you mentioned in your question. It returns "Idle..." when it isn't actively processing files. The absolutely simplest mechanism I can imagine for achieving what you want would be a while loop that checked the output of dropbox.py filestatus /home/directory/to/watch and performed an scp of the contents and then a delete on the contents if that suceeded. Then slept for five minutes or so.
Something like:
import time
from subprocess import check_call, check_output
DIR = "/directory/to/watch/"
REMOTE_DIR = "user#my_server.com:/folder"
While True:
if check_output(["dropbox.py", "status", DIR]) == "\nIdle...":
if check_call(["scp", "-r", DIR + "*", REMOTE_DIR]):
check_call(["rm", "-rf", DIR + "*"])
time.sleep(360)
Of course I would be very careful when testing something like this, put the wrong thing in that second check_call and you could lose your filesystem.
You could run incrond and have it wait for IN_CLOSE_WRITE events in your Dropbox folder. Then it would only be triggered when a file transfer completed.
Here is a Ruby version that doesn't wait for Dropbox to be idle, therefore can actually start moving files, while it is still syncing. Also it ignores . and ... It actually checks the filestatus of each file within a given directory.
Then I would run this script either as a cronjob or in a separate screen.
directory = "path/to/dir"
destination = "location/to/move/to"
Dir.foreach(directory) do |item|
next if item == '.' or item == '..'
fileStatus = `~/bin/dropbox.py filestatus #{directory + "/" + item}`
puts "processing " + item
if (fileStatus.include? "up to date")
puts item + " is up to date, starting to move file now."
# cp command here. Something along this line: `cp #{directory + "/" + item + destination}`
# rm command here. Probably you want to confirm that all copied files are correct by comparing md5 or something similar.
else
puts item + " is not up to date, moving on to next file."
end
end
This is the full script, I ended up with:
# runs in Ruby 1.8.x (ftools)
require 'ftools'
directory = "path/to/dir"
destination = "location/to/move/to"
Dir.glob(directory+"/**/*") do |item|
next if item == '.' or item == '..'
fileStatus = `~/bin/dropbox.py filestatus #{item}`
puts "processing " + item
puts "filestatus: " + fileStatus
if (fileStatus.include? "up to date")
puts item.split('/',2)[1] + " is up to date, starting to move file now."
`cp -r #{item + " " + destination + "/" + item.split('/',2)[1]}`
# remove file in Dropbox folder, if current item is not a directory and
# copied file is identical.
if (!File.directory?(item) && File.cmp(item, destination + "/" + item.split('/',2)[1]).to_s)
puts "remove " + item
`rm -rf #{item}`
end
else
puts item + " is not up to date, moving to next file."
end
end

Using Python to execute a command on every file in a folder

I'm trying to create a Python script that would :
Look into the folder "/input"
For each video in that folder, run a mencoder command (to transcode them to something playable on my phone)
Once mencoder has finished his run, delete the original video.
That doesn't seem too hard, but I suck at python :)
Any ideas on what the script should look like ?
Bonus question : Should I use
os.system
or
subprocess.call
?
Subprocess.call seems to allow for a more readable script, since I can write the command like this :
cmdLine = ['mencoder',
sourceVideo,
'-ovc',
'copy',
'-oac',
'copy',
'-ss',
'00:02:54',
'-endpos',
'00:00:54',
'-o',
destinationVideo]
EDIT : Ok, that works :
import os, subprocess
bitrate = '100'
mencoder = 'C:\\Program Files\\_utilitaires\\MPlayer-1.0rc2\\mencoder.exe'
inputdir = 'C:\\Documents and Settings\\Administrator\\Desktop\\input'
outputdir = 'C:\\Documents and Settings\\Administrator\\Desktop\\output'
for fichier in os.listdir(inputdir):
print 'fichier :' + fichier
sourceVideo = inputdir + '\\' + fichier
destinationVideo = outputdir + '\\' + fichier[:-4] + ".mp4"
commande = [mencoder,
'-of',
'lavf',
[...]
'-mc',
'0',
sourceVideo,
'-o',
destinationVideo]
subprocess.call(commande)
os.remove(sourceVideo)
raw_input('Press Enter to exit')
I've removed the mencoder command, for clarity and because I'm still working on it.
Thanks to everyone for your input.
To find all the filenames use os.listdir().
Then you loop over the filenames. Like so:
import os
for filename in os.listdir('dirname'):
callthecommandhere(blablahbla, filename, foo)
If you prefer subprocess, use subprocess. :-)
Use os.walk to iterate recursively over directory content:
import os
root_dir = '.'
for directory, subdirectories, files in os.walk(root_dir):
for file in files:
print os.path.join(directory, file)
No real difference between os.system and subprocess.call here - unless you have to deal with strangely named files (filenames including spaces, quotation marks and so on). If this is the case, subprocess.call is definitely better, because you don't need to do any shell-quoting on file names. os.system is better when you need to accept any valid shell command, e.g. received from user in the configuration file.
The new recommend way in Python3 is to use pathlib:
from pathlib import Path
mydir = Path("path/to/my/dir")
for file in mydir.glob('*.mp4'):
print(file.name)
# do your stuff
Instead of *.mp4 you can use any filter, even a recursive one like **/*.mp4. If you want to use more than one extension, you can simply iterate all with * or **/* (recursive) and check every file's extension with file.name.endswith(('.mp4', '.webp', '.avi', '.wmv', '.mov'))
Python might be overkill for this.
for file in *; do mencoder -some options "$file"; rm -f "$file" ; done
The rm -f "$file" deletes the files.
AVI to MPG (pick your extensions):
files = os.listdir('/input')
for sourceVideo in files:
if sourceVideo[-4:] != ".avi"
continue
destinationVideo = sourceVideo[:-4] + ".mpg"
cmdLine = ['mencoder', sourceVideo, '-ovc', 'copy', '-oac', 'copy', '-ss',
'00:02:54', '-endpos', '00:00:54', '-o', destinationVideo]
output1 = Popen(cmdLine, stdout=PIPE).communicate()[0]
print output1
output2 = Popen(['del', sourceVideo], stdout=PIPE).communicate()[0]
print output2
Or you could use the os.path.walk function, which does more work for you than just os.walk:
A stupid example:
def walk_func(blah_args, dirname,names):
print ' '.join(('In ',dirname,', called with ',blah_args))
for name in names:
print 'Walked on ' + name
if __name__ == '__main__':
import os.path
directory = './'
arguments = '[args go here]'
os.path.walk(directory,walk_func,arguments)
I had a similar problem, with a lot of help from the web and this post I made a small application, my target is VCD and SVCD and I don't delete the source but I reckon it will be fairly easy to adapt to your own needs.
It can convert 1 video and cut it or can convert all videos in a folder, rename them and put them in a subfolder /VCD
I also add a small interface, hope someone else find it useful!
I put the code and file in here btw: http://tequilaphp.wordpress.com/2010/08/27/learning-python-making-a-svcd-gui/

Categories