I am trying to unzip fasta.gz files in order to work with them. I have created a script using cmd base on something I have done before but now I cannot manage to work the newly created function. See below:
import glob
import sys
import os
import argparse
import subprocess
import gzip
#import gunzip
def decompressed_files():
print ('starting decompressed_files')
#files where the data is stored
input_folder=('/home/me/me_files/PB_assemblies_for_me')
#where I want my data to be
output_folder=input_folder + '/fasta_files'
if os.path.exists(output_folder):
print ('folder already exists')
else:
os.makedirs(output_folder)
print ('folder has been created')
for f in input_folder:
fasta=glob.glob(input_folder + '/*.fasta.gz')
#print (fasta[0])
#sys.exit()
cmd =['gunzip', '-k', fasta, output_folder]
my_file=subprocess.Popen(cmd)
my_file.wait
decompressed_files()
print ('The programme has finished doing its job')
But this give the following error:
TypeError: execv() arg 2 must contain only strings
If I write fasta, the programme looks for a file an the error becomes:
fasta.gz: No such file or directory
If I go to the directory where I have the files and I key gunzip, name_file_fasta_gz, it does the job beautifully but I have a few files in the folder and I would like to create the loop. I have used 'cmd' before as you can see in the code below and I didn't have any problem with it. Code from the past where I was able to put string, and non-string.
cmd=['velveth', output, '59', '-fastq.gz', '-shortPaired', fastqs[0], fastqs[1]]
#print cmd
my_file=subprocess.Popen(cmd)#I got this from the documentation.
my_file.wait()
I will be happy to learn other ways to insert linux commands within a python function. The code is for python 2.7, I know it is old but it is the one is install in the server at work.
fasta is a list returned by glob.glob().
Hence cmd = ['gunzip', '-k', fasta, output_folder] generates a nested list:
['gunzip', '-k', ['foo.fasta.gz', 'bar.fasta.gz'], output_folder]
but execv() expects a flat list:
['gunzip', '-k', 'foo.fasta.gz', 'bar.fasta.gz', output_folder]
You can use the list concentration operator + to create a flat list:
cmd = ['gunzip', '-k'] + fasta + [output_folder]
I haven't tested this but it might solve you unzip problem using command.
command gunzip -k is to keep both the compressed and decompressed file then what is the purpose of output directory.
import subprocess
import gzip
def decompressed_files():
print('starting decompressed_files')
# files where the data is stored
input_folder=('input')
# where I want my data to be
output_folder = input_folder + '/output'
if os.path.exists(output_folder):
print('folder already exists')
else:
os.makedirs(output_folder)
print('folder has been created')
for f in os.listdir(input_folder):
if f and f.endswith('.gz'):
cmd = ['gunzip', '-k', f, output_folder]
my_file = subprocess.Popen(cmd)
my_file.wait
print(cmd) will look as shown below
['gunzip', '-k', 'input/sample.gz', 'input/output']
I have a few files in the folder and I would like to create the loop
From above quote your actual problem seems to be unzip multiple *.gz files from path
in that case below code should solve your problem.
import os
import shutil
import fnmatch
def gunzip(file_path,output_path):
with gzip.open(file_path,"rb") as f_in, open(output_path,"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
def make_sure_path_exists(path):
try:
os.makedirs(path)
except OSError:
if not os.path.isdir(path):
raise
def recurse_and_gunzip(input_path):
walker = os.walk(input_path)
output_path = 'files/output'
make_sure_path_exists(output_path)
for root, dirs, files in walker:
for f in files:
if fnmatch.fnmatch(f,"*.gz"):
gunzip(root + '/' + f, output_path + '/' + f.replace(".gz",""))
recurse_and_gunzip('files')
source
EDIT:
Using command line arguments -
subprocess.Popen(base_cmd + args) :
Execute a child program in a new process. On Unix, the class uses os.execvp()-like behavior to execute the child program
fasta.gz: No such file or directory
So any extra element to cmd list is treated as argument and gunzip will look for argument.gz file hence the error fasta.gz file not found.
ref and some useful examples
Now if you want to pass gz files as command line argument you can still do that with below code( you might need to polish little bit as per your need)
import argparse
import subprocess
import os
def write_to_desired_location(stdout_data,output_path):
print("Going to write to path", output_path)
with open(output_path, "wb") as f_out:
f_out.write(stdout_data)
def decompress_files(gz_files):
base_path=('files') # my base path
output_path = base_path + '/output' # output path
if os.path.exists(output_path):
print('folder already exists')
else:
os.makedirs(output_path)
print('folder has been created')
for f in gz_files:
if f and f.endswith('.gz'):
print('starting decompressed_files', f)
proc = subprocess.Popen(['gunzip', '-dc', f], stdout=subprocess.PIPE) # d:decompress and c:stdout
write_to_desired_location(proc.stdout.read(), output_path + '/' + f.replace(".gz", ""))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-gzfilelist",
required=True,
nargs="+", # 1 or more arguments
type=str,
help='Provide gz files as arguments separated by space Ex: -gzfilelist test1.txt.tar.gz test2.txt.tar.gz'
)
args = parser.parse_args()
my_list = [str(item)for item in args.gzfilelist] # converting namedtuple into list
decompress_files(gz_files=my_list)
execution:
python unzip_file.py -gzfilelist test.txt.tar.gz
output
folder already exists
('starting decompressed_files', 'test.txt.tar.gz')
('Going to write to path', 'files/output/test.txt.tar')
You can pass multiple gz files as well for example
python unzip_file.py -gzfilelist test1.txt.tar.gz test2.txt.tar.gz test3.txt.tar.gz
Related
I have this python script which will take these three arguments:
a given path for a directory with files to rename
a CSV file with two columns to map the file names to:
original,new
barcode01,sample01
barcode02,sample02
extension of the file (i.e. .txt, .bam, .png, .txt.readdb.log) which can be long.
The script:
import os
import csv
def rename_files(path, name_map, ext):
with open(name_map, 'r') as csv_map:
filereader = csv.DictReader(csv_map)
for row in filereader:
original_name = row["original"]
new_name = row["new"]
old_filename = '%s/%s.%s' % (path, original_name, ext)
new_filename = '%s/%s_%s.%s' % (path, new_name, original_name, ext)
try:
os.rename(old_filename, new_filename)
except Exception as e:
print('Rename for file %s failed. Details: ' % old_filename)
print (e)
if __name__ == '__main__':
filename, path, name_map, ext = sys.argv
rename_files(path, name_map, ext)
For example:
python rename.py /test/directory filestorename.csv txt
will only rename barcode01.txt to sample01.txt.
However, there are multiple barcode01 files with different extensions (i.e. barcode01.png). Instead of passing these extensions as arguments to the script, how can I modify this script to just rename all these files at once, keeping the extension the same?
Assuming all files exists, you may extract the base directory, basename and file extension as follows:
from csv import DictReader
from os import path, rename
from sys import exit
import argparse
def rename_file(row):
origin = row['original']
directory = path.dirname(origin)
_, extension = path.splitext(path.basename(origin))
target = path.join(directory, '{}{}'.format(row['new'], extension))
return rename(origin, target)
call it inside a loop:
def rename_files(spreadsheet):
csv = DictReader(open(spreadsheet))
valid_rows = filter(lambda row: path.isfile(row['original']), csv)
for row in valid_rows:
rename_file(row)
You also may improve your main function:
def main():
parser = argparse.ArgumentParser('rename files from *.csv')
parser.add_argument(
'-f', '--file',
metavar='file',
type=str,
help='csv (comma-separated values) file'
)
args = parser.parse_args()
if not path.isfile(args.file):
print('No such file: {}'.format(args.file))
return exit(1)
return rename_files(args.file)
if __name__ == '__main__':
main()
I would use the pathlib library, which makes dealing with file name easier. Note that in pathlib, a file name without extension is called a stem.
#!/usr/bin/env python3
import csv
import pathlib
import sys
def rename_files(directory, name_map):
directory = pathlib.Path(directory)
with open(name_map) as stream:
reader = csv.reader(stream)
next(reader) # Skip the header
for old_name, new_name in reader:
for old_path in directory.glob(old_name + ".*"):
new_path = old_path.with_stem(new_name)
old_path.rename(new_path)
if __name__ == '__main__':
rename_files(sys.argv[1], sys.argv[2])
Example:
python rename.py /test/directory filestorename.csv
Notes
The key to rename files, regardless of extension, is to use the .glob() function to find all files with the same name, but with different extensions.
The .with_stem() function basically take the path and return another path with different stem (filename minus extension)
I am using a Python script to batch convert many images in different folders into single pdfs (with https://pypi.org/project/img2pdf/):
import os
import subprocess
import img2pdf
from shutil import copyfile
def main():
folders = [name for name in os.listdir(".") if os.path.isdir(name)]
for f in folders:
files = [f for f in os.listdir(f)]
p = ""
for ffile in files:
p += f+'\\' + ffile + " "
os.system("py -m img2pdf *.pn* " + p + " --output " + f + "\combined.pdf")
if __name__ == '__main__':
main()
However, despite running the command via Powershell on Windows 10, and despite using very short filenames, when the number of images is very high (eg over 600 or so), Powershell gives me the error "The command line is too long" and it does not create the pdf. I know there is a command-line string limitation (https://learn.microsoft.com/en-us/troubleshoot/windows-client/shell-experience/command-line-string-limitation), but I also know that for powershell this limit is higher (Powershell to avoid cmd 8191 character limit), and I can't figure out how to fix the script. I would like to ask you if you could help me fix the script to avoid violating the character limit. Thank you
PS: I use the script after inserting it in the parent folder that contains the folders with the images; then in each subfolder the output pdf file is created.
Using img2pdf library you can use this script:
import img2pdf
import os
for r, _, f in os.walk("."):
imgs = []
for fname in f:
if fname.endswith(".jpg") or fname.endswith(".png"):
imgs.append(os.path.join(r, fname))
if len(imgs) > 0:
with open(r+"\output.pdf","wb") as f:
f.write(img2pdf.convert(imgs))
I'm in the process of writing a python script that takes two arguments that will allow me to output the contents of a folder to a text file for me to use for another process. The snippet of I have is below:
#!/usr/bin/python
import cv2
import numpy as np
import random
import sys
import os
import fileinput
#Variables:
img_path= str(sys.argv[1])
file_path = str(sys.argv[2])
print img_path
print file_path
cmd = 'find ' + img_path + '/*.png | sed -e "s/^/\"/g;s/$/\"/g" >' + file_path + '/desc.txt'
print "command: ", cmd
#Generate desc.txt file:
os.system(cmd)
When I try and run that from my command line, I get the following output, and I have no idea how to fix it.
sh: 1: s/$//g: not found
I tested the command I am using by running the following command in a fresh terminal instance, and it works out fine:
images/*.png | sed -e "s/^/\"/g;s/$/\"/g" > desc.txt
Can anyone see why my snippet isn't working? When I run it, I get an empty file...
Thanks in advance!
its not sending the full text for your regular expression through to bash because of how python processes and escapes string content, so the best quickest solution would be to just manually escape the back slashes in the string, because python thinks they currently are escape codes. so change this line:
cmd = 'find ' + img_path + '/*.png | sed -e "s/^/\"/g;s/$/\"/g" >' + file_path + '/desc.txt'
to this:
cmd = 'find ' + img_path + '/*.png | sed -e "s/^/\\"/g;s/$/\\"/g" >' + file_path + '/desc.txt'
and that should work for you.
although, the comment on your question has a great point, you could totally just do it from python, something like:
import os
import sys
def main():
# variables
img_path= str(sys.argv[1])
file_path = str(sys.argv[2])
with open(file_path,'w') as f:
f.writelines(['{}\n'.format(line) for line in os.listdir(img_path) if line.endswith('*.png')])
if __name__ == "__main__":
main()
I fully agree with Kyle. My recommendation is to do using only python code better than call bash commands from your code. Here it is my recommended code, it is longer and not as optimal than the aforementioned one, but IMHO it is a more easy to understand solution.
#!/usr/bin/python
import glob
import sys
import os
# Taking arguments
img_path = str(sys.argv[1])
file_path = str(sys.argv[2])
# lets put the target filename in a variable (it is better than hardcoding it)
file_name = 'desc.txt'
# folder_separator is used to define how your operating system separates folders (unix / and windows \)
folder_separator = '\\' # Windows folders
# folder_separator = '/' # Unix folders
# better if you make sure that the target folder exists
if not os.path.exists(file_path):
# if it does not exist, you create it
os.makedirs(file_path)
# Create the target file (write mode).
outfile = open(file_path + '/' + file_name, 'w')
# loop over folder contents
for fname in glob.iglob("%s/*" % img_path):
# for every file found you take only the name (assuming that structure is folder/file.ext)
file_name_in_imgPath = fname.split('\\')[1]
# we want to avoid to write 'folders' in the target file
if os.path.isfile(file_name_in_imgPath):
# write filename in the target file
outfile.write(str(file_name_in_imgPath) + '\n')
outfile.close()
As an newcomer to python I figured I'd write a little python3 script to help me switch directories on the command line (ubuntu trusty). Unfortunately os.chdir() does not seems to work.
I've tried tinkering with it in various ways such as placing quotes around the path, removing the leading slash (which obviously doesn't work) and even just hardcoding it, but I can't get it to work - can anybody tell me what I'm missing here?
The call to chdir() happens towards the end - you can see the code in github too
#!/usr/bin/env python3
# #python3
# #author sabot <sabot#inuits.eu>
"""Switch directories without wearing out your slash key"""
import sys
import os
import json
import click
__VERSION__ = '0.0.1'
# 3 params are needed for click callback
def show_version(ctx, param, value):
"""Print version information and exit."""
if not value:
return
click.echo('Goto %s' % __VERSION__)
ctx.exit() # quit the program
def add_entry(dictionary, filepath, path, alias):
"""Add a new path alias."""
print("Adding alias {} for path {} ".format(alias,path))
dictionary[alias] = path
try:
jsondata = json.dumps(dictionary, sort_keys=True)
fd = open(filepath, 'w')
fd.write(jsondata)
fd.close()
except Exception as e:
print('Error writing to dictionary file: ', str(e))
pass
def get_entries(filename):
"""Get the alias entries in json."""
returndata = {}
if os.path.exists(filename) and os.path.getsize(filename) > 0:
try:
fd = open(filename, 'r')
entries = fd.read()
fd.close()
returndata = json.loads(entries)
except Exception as e:
print('Error reading dictionary file: ', str(e))
pass
else:
print('Dictionary file not found or empty- spawning new one in', filename)
newfile = open(filename,'w')
newfile.write('')
newfile.close()
return returndata
#click.command()
#click.option('--version', '-v', is_flag=True, is_eager=True,
help='Print version information and exit.', expose_value=False,
callback=show_version)
#click.option('--add', '-a', help="Add a new path alias")
#click.option('--target', '-t', help="Alias target path instead of the current directory")
#click.argument('alias', default='currentdir')
#click.pass_context
def goto(ctx, add, alias, target):
'''Go to any directory in your filesystem'''
# load dictionary
filepath = os.path.join(os.getenv('HOME'), '.g2dict')
dictionary = get_entries(filepath)
# add a path alias to the dictionary
if add:
if target: # don't use current dir as target
if not os.path.exists(target):
print('Target path not found!')
ctx.exit()
else:
add_entry(dictionary, filepath, target, add)
else: # use current dir as target
current_dir = os.getcwd()
add_entry(dictionary, filepath, current_dir, add)
elif alias != 'currentdir':
if alias in dictionary:
entry = dictionary[alias]
print('jumping to',entry)
os.chdir(entry)
elif alias == 'hell':
print("Could not locate C:\Documents and settings")
else:
print("Alias not found in dictionary - did you forget to add it?")
if __name__ == '__main__':
goto()
The problem is not with Python, the problem is that what you're trying to do is impossible.
When you start a Python interpreter (script or interactive REPL), you do so from your "shell" (Bash etc.). The shell has some working directory, and it launches Python in the same one. When Python changes its own working directory, it does not affect the parent shell, nor do changes in the working directory of the shell affect Python after it has started.
If you want to write a program which changes the directory in your shell, you should define a function in your shell itself. That function could invoke Python to determine the directory to change to, e.g. the shell function could be simply cd $(~/myscript.py) if myscript.py prints the directory it wants to switch to.
Here's Python 3 version of #ephemient's C solution:
#!/usr/bin/env python3
"""Change parent working directory."""
#XXX DIRTY HACK, DO NOT DO IT
import os
import sys
from subprocess import Popen, PIPE, DEVNULL, STDOUT
gdb_cmd = 'call chdir("{dir}")\ndetach\nquit\n'.format(dir=sys.argv[1])
with Popen(["gdb", "-p", str(os.getppid()), '-q'],
stdin=PIPE, stdout=DEVNULL, stderr=STDOUT) as p:
p.communicate(os.fsencode(gdb_cmd))
sys.exit(p.wait())
Example:
# python3 cd.py /usr/lib && python3 -c 'import os; print(os.getcwd())'
I have written below code to generate hash code for all mp3 files available in a directory. But system is throwing error for files having space in name
directory - d:\song
Files in the directory AB CD.mp3, Abc.mp3, GB.mp3
import os
dirname = 'd:\song'
def walk(dirname):
names = []
for name in os.listdir(dirname):
path = os.path.join(dirname,name)
if os.path.isfile(path):
names.append(path)
else:
names.extend(walk(path))
return names
def chk_dup(f):
for i in f:
cmd = 'fciv -md5 %s' % i.replace(' ','')
fp = os.popen(cmd)
res = fp.read()
print(res)
fp.close()
chk_dup(walk(dirname))
Output is
//
// File Checksum Integrity Verifier version 2.05.
//
d:\song\abcd.mp3\*
Error msg : The system cannot find the path specified.
Error code : 3
//
// File Checksum Integrity Verifier version 2.05.
//
1a65b4c63d64f0634c1411d37629be3b d:\song\abc.mp3
//
// File Checksum Integrity Verifier version 2.05.
//
bbf47eb1cb3625eea648f0b6e0784fd3 d:\song\gb.mp3
You can probably fix your immediate problem by enclosing all the file path name arguments in double quotes in case they contain spaces. This will make it be treated it as a single argument rather than two (or more) of them which is the case otherwise.
for i in f:
cmd = 'fciv -md5 "%s"' % i
...
However, rather than just do that, I would suggest that you stop usingos.popen()altogether, because it has been deprecated since Python version 2.6, and use the recommendedsubprocess module instead. Among other advantages, doing so will automatically handle the quoting of arguments with spaces in them for you.
In addition it would also be useful for you to take advantage of the built-inos.walk()function to simplify your ownwalk()function.
Incorporating both of these changes would result in code looking something like the following:
import os
import subprocess
directory = r'd:\song'
def walk(dirname):
for root, dirs, files in os.walk(dirname):
for name in files:
path = os.path.join(root, name)
yield path
def chk_dup(files):
for file in files:
args = ['fciv', '-md5', file] # cmd as sequence of arguments
p = subprocess.Popen(args, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
res = p.communicate()[0] # communicate returns (stdoutdata, stderrdata)
print res
chk_dup(walk(directory))
Your file is "AB CD.mp3", no "ABCD.mp3". Therefore, the file "ABCD.mp3" cannot be found.
Try to use ' to fill the command:
cmd = "fciv -md5 '%s'" % i