Batch (basename) file/folder renaming using an "index" - python

Renaming of files and folder in batch is a question often asked but after some search I think none is similar to mine.
Background: we send some biological samples to a service provider which returns files with unique names and a table in text format containing, amongst other information, the file name and the sample that originated it:
head samples.txt
fq_file Sample_ID Sample_name Library_ID FC_Number Track_Lanes_Pos
L2369_Track-3885_R1.fastq.gz S1746_B_7_t B 7 t L2369_B_7_t 163 6
L2349_Track-3865_R1.fastq.gz S1726_A_3_t A 3 t L2349_A_3_t 163 5
L2354_Track-3870_R1.fastq.gz S1731_A_GFP_c A GFP c L2354_A_GFP_c 163 5
L2377_Track-3893_R1.fastq.gz S1754_B_7_c B 7 c L2377_B_7_c 163 7
L2362_Track-3878_R1.fastq.gz S1739_B_GFP_t B GFP t L2362_B_GFP_t 163 6
The directory structure (for 34 directories):
L2369_Track-3885_
accepted_hits.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
L2349_Track-3865_
accepted_hits.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
Goal: because the file names are meaningless and hard to interpret, I want to rename the files ending in .bam (keeping the suffix) and the folders with the correspondent sample name, re-ordered in a more suitable manner. The result should look like:
7_t_B
7_t_B..bam
deletions.bed
junctions.bed
logs
7_t_B.bam.bai
insertions.bed
left_kept_reads.info
3_t_A
3_t_A.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
I've hacked together a solution with bash and python (newbie) but it feels over-engineered. The question is whether there is a more simple/elegant way of doing it that I've missed? Solutions can be in python, bash, and R. could also be awk since I am trying to learn it. Being a relative beginner does make one complicate things.
This is my solution:
A wrapper puts it all in place and gives an idea of the workflow:
#! /bin/bash
# select columns of interest and write them to a file - basenames
tail -n +2 samples.txt | cut -d$'\t' -f1,3 >> BAMfilames.txt
# call my little python script that creates a new .sh with the renaming commmands
./renameBamFiles.py
# finally do the renaming
./renameBam.sh
# and the folders to
./renameBamFolder.sh
renameBamFiles.py:
#! /usr/bin/env python
import re
# Read in the data sample file and create a bash file that will remane the tophat output
# the reanaming will be as follows:
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B
#
# Set the input file name
# (The program must be run from within the directory
# that contains this data file)
InFileName = 'BAMfilames.txt'
### Rename BAM files
# Open the input file for reading
InFile = open(InFileName, 'r')
# Open the output file for writing
OutFileName= 'renameBam.sh'
OutFile=open(OutFileName,'a') # You can append instead with 'a'
OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")
# Loop through each line in the file
for Line in InFile:
## Remove the line ending characters
Line=Line.strip('\n')
## Separate the line into a list of its tab-delimited components
ElementList=Line.split('\t')
# separate the folder string from the experimental name
fileroot=ElementList[1]
fileroot=fileroot.split()
# create variable names using regex
folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
folderName=folderName.strip('\n')
fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])
command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName)
print command
OutFile.write(command+"\n")
# After the loop is completed, close the files
InFile.close()
OutFile.close()
### Rename folders
# Open the input file for reading
InFile = open(InFileName, 'r')
# Open the output file for writing
OutFileName= 'renameBamFolder.sh'
OutFile=open(OutFileName,'w')
OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")
# Loop through each line in the file
for Line in InFile:
## Remove the line ending characters
Line=Line.strip('\n')
## Separate the line into a list of its tab-delimited components
ElementList=Line.split('\t')
# separate the folder string from the experimental name
fileroot=ElementList[1]
fileroot=fileroot.split()
# create variable names using regex
folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
folderName=folderName.strip('\n')
fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])
command= "mv %s %s" % (folderName, fileName)
print command
OutFile.write(command+"\n")
# After the loop is completed, close the files
InFile.close()
OutFile.close()
RenameBam.sh - created by the previous python script:
#! /bin/bash
for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done
(..)
Rename renameBamFolder.sh is very similar:
mv L2369_Track-3885_R1_ 7_t_B
mv L2349_Track-3865_R1_ 3_t_A
mv L2354_Track-3870_R1_ GFP_c_A
mv L2377_Track-3893_R1_ 7_c_B
Since I am learning, I feel that some examples of different ways of doing this, and thinking about how to do it, will be very useful.

One simple way in bash:
find . -type d -print |
while IFS= read -r oldPath; do
parent=$(dirname "$oldPath")
old=$(basename "$oldPath")
new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt)
if [ -n "$new" ]; then
newPath="${parent}/${new}"
echo mv "$oldPath" "$newPath"
echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam"
fi
done
Remove the "echo"s after initial testing to get it to actually do the "mv"s.
If all of your target directories are at one level as #triplee's answer implies, then it's even simpler. Just cd to their parent directory and do:
awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt |
while read -r old new; do
echo mv "$old" "$new"
echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam"
done
In one of your expected outputs you renamed the ".bai" file, in the other you didn't and you didn't say if you want to do that or not. If you want to rename it too just add
echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai"
to whatever solution above you prefer.

Of course you can do it only in Python - and it can yield a small readble script for that.
First thing on: read the sampels.txt fil and create a map from existing file prefixes to desired mapping prefixes - the file is not formatted to use the Python CSV reader module, as the column separator is used inside the last data column.
mapping = {}
with open("samples.txt") as samples:
# throw away headers
samples.readline()
for line in samples():
# separate the columns spliting the first whitespace ocurrences:
# (either space sequences or tabs)
fields = line.split()
# skipp blank, malformed lines:
if len(fields) < 6:
continue
fq_file, sample_id, Sample_name, Library_ID, FC_Number, track_lanes_pos, *other = fields
# the [:-2] part is to trhow awauy the "R1" sufix as for the example above
file_prefix = fq_file.split(".")[0][:-2]
target_id = "_".join((Library_ID, FC_number. Sample_name))
mapping[file_prefix] = target_id
Then check the dir names, and inside each one the ".bam" files for remapping.
import os
for entry in os.listdir("."):
if entry in mapping:
dir_prefix = "./" + entry + "/")
for file_entry in os.listdir(dir_prefix):
if ".bam" in file_entry:
parts = file_entry.split(".bam")
parts[0] = mapping[entry]
new_name = ".bam".join(parts)
os.rename(dir_prefix + file_entry, dir_prefix + new_name)
os.rename(entry, mapping[entry])

Seems that you could simply read the required fields from the index file in a simple while loop. It's not obvious how the file is structured, so I am assuming that the file is whitespace-separated and that the Sample_Id is actually four fields (complex sample_id, then three components from the name). Maybe you have a tab-delimited file with internal spaces in the Sample_Id field? Anyhow, this should be easy to adapt if my assumptions are wrong.
# Skip the annoying field names
tail +1 samples.txt |
while read fq _ c a b chaff; do
dir=${fq%R1.fastq.gz}
new="${a}_${b}_$c"
echo mv "$dir"/accepted_hits.bam "$dir/$new".bam
echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai
echo mv "$dir" "$new"
done
Take out the echos if the output looks like what you want.

Here's one way using a shell script. Run like:
script.sh /path/to/samples.txt /path/to/data
Contents of script.sh:
# add directory names to an array
while IFS= read -r -d '' dir; do
dirs+=("$dir")
done < <(find $2/* -type d -print0)
# process the sample list
while IFS=$'\t' read -r -a list; do
for i in "${dirs[#]}"; do
# if the directory is in the sample list
if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then
tag="${list[3]}_${list[4]}_${list[2]}"
new="${i%/*}/$tag"
bam="$new/accepted_hits.bam"
# only change name if there's a bam file
if [ -n $bam ]; then
mv "$i" "$new"
mv "$bam" "$new/$tag.bam"
fi
fi
done
done < <(tail -n +2 $1)

Although it's not exactly what you're looking for (just thinking outside the box): you might consider a alternate "view" of your file system -- using the term "view" like a database view is to a table. You could do this via a "file system in user space", FUSE. One can do this with a number of existing utilities, but I don't know of one that just generically works with any set of files, specifically for just renaming/re-organizing. But as a concrete example of how it can be used, pytagsfs creates a virtual (fuse) file system based on rules you define, making a directory structure of files appear however you want. (Maybe this would work for you, too -- but pytagsfs is actually intended for media files.) And then you just operate on that (virtual) file system, using whatever programs normally access that data. Or, to make the virtual directory structure permanent (if pytagsfs doesn't have an option to do this already), just copy the virtual file system to into another directory (outside the virtual file system).

Related

How can I move files to new folders (not created yet) according to a regex pattern?

I'm on a Macbook and would like to use bash to accomplish task.
Let's say I have a list of files in one directory like:
item1_2352352352.jpg
item2_2352352352.jpg
item3_2352352352.jpg
item3_2352352352.jpg
How can I sort and move these files to new folders with a bash one-liner? If not bash, python is OK too.
I'd like the folders to be named as item1, item2 etc. (everything before the first underscore).
That would do the trick:
for i in item*_*; do folder=${i%_*}; mkdir -p "$folder"; mv -n "$i" "$folder"; done
Explanation:
for i in item*_*; do # for loop, loops over every file which fits the expression "item*_*" and puts current filename in $i
folder=${i%_*} # removes everything after a "_" from $i and saves the output to $folder
mkdir -p "$folder" # creates folder with content of $folder and without errors if directory exists (-p)
mv -n "$i" "$folder" # moves files without overwriting (-n)
done

how to split a full file path into a path and a file name without an extension

how to split a full file path into a path and a file name without an extension?
I'm looking for any files with the extension .conf:
find /path -name .conf
/path/file1.conf
/path/smth/file2.conf
/path/smth/file3.conf
/path/smth/smth1/.conf
...
/path/smt//*.conf
I need the output in string(without extension .conf):
/path;file1|path/smth;file2;file3|...
What's the best way to do it?
I was thinking of a solution - save the output of the find work to a file and process them in a loop..but maybe there is a more effective way.
Sorry for mistakes, I newbie..
Thanx for u feedback, guys!
since you mentioned .conf, does this help?
kent$ basename -s .conf '/path/smth/file2.conf'
file2
kent$ dirname '/path/smth/file2.conf'
/path/smth
To do this in Bash:
find /path/ -type f -name "*.conf"
Note that if you want to do this in a Bash script, you can store /path/ in a variable, for instance one named directory, and change the command like so:
find $directory -type f -name "*.conf"
To do this in Python:
import os
PATH = /path/
test_files = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames
if os.path.splitext(f)[1] == '.json']
There are some other ways to do this in Python listed here as well
bash parameter parsing is easy, fast, and lightweight.
for fp in /path/file1.conf /path/smth/file2.conf /path/smth/file3.conf; do
p="${fp%/*}" # % strips the pattern from the end (minimal, non-greedy)
f="${fp##*/}" # ## strips the pattern from the beginning (max-match, greedy)
f="${f%.*}" # end-strip the already path-cleaned filename to remove extention
echo "$p, $f"
done
/path, file1
/path/smth, file2
/path/smth, file3
To get what you apparently want as your formatting -
declare -A paths # associative array
while read -r fp; do
p=${fp%/*} f=${fp##*/}; # preparse path and filename
paths[$p]="${paths[$p]};${f%.*}"; # p as key, stacked/delimited val
done < file
Then stack/delimit your datasets.
for p in "${!paths[#]}"; do printf "%s|" "$p${paths[$p]}"; done; echo
/path;file1|/path/smth;file2;file3|
For each key, print key/val and a delimiter. echo at end for a newline.
If you don't want the trailing pipe, assign it all to one var in the second loop instead of printing it out, and trim the trailing pipe at the end.
$: for p in "${!paths[#]}"; do out="$out$p${paths[$p]}|"; done; echo "${out%|}"
/path;file1|/path/smth;file2;file3
Some folk will tell you not to use bash for anything this complex. Be aware that it can lead to ugly maintenance, especially if the people maintaining it behind you aren't bash experts and can't be bothered to go RTFM.
If you actually needed that embedded space in your example then your rules are inconsistent and you'll have to explain them.
if you have the file paths in a list you can do this using a dictionary with key the path and value the filename
aa=['/path/file1.conf','/path/smth/file2.conf','/path/smth/file3.conf']
f={}
for x in aa:
temp=x[:-len(".conf")].split("/")
filename=temp[-1]
path="/".join(temp[:-1])
if path in f:
f[path]=f[path]+","+filename
else:
f[path]=filename
result=""
for x in f:
result=result+str(x)+";"+f[x]+"|"
print(result)

Extracting inner file from zip file with python

I am able to extract the inner file, but it extracts the entire chain.
Suppose the following file structure
v a.zip
v folder1
v folder2
> inner.txt
and suppose I want to extract inner.txt to some folder target.
Currently what happens when I try to do this is that I end up extracting folder1/folder2/inner.txt to target. Is it possible to extract the single file instead of the entire chain of directories? So that when target is opened, the only thing inside is inner.txt.
EDIT:
Using python zip module to unzip files and extract only the inner files to the desired location.
You should use the -j (junk paths (do not make directories)) modifier (old v5.52 has it). Here's the full list: [DIE.Linux]: unzip(1) - Linux man page, or you could simply run (${PATH_TO}/)unzip in the terminal, and it will output the argument list.
Considering that you want to extract the file in a folder called target, use the command (you may need to specify the path to unzip):
"unzip" -j "a.zip" -d "target" "folder1/folder2/inner.txt"
Output (Win, but for Nix it's the same thing):
(py35x64_test) c:\Work\Dev\StackOverflow\q047439536>"unzip" -j "a.zip" -d "target" "folder1/folder2/inner.txt"
Archive: a.zip
inflating: target/inner.txt
Output (without -j):
(py35x64_test) c:\Work\Dev\StackOverflow\q047439536>"unzip" "a.zip" -d "target" "folder1/folder2/inner.txt"
Archive: a.zip
inflating: target/folder1/folder2/inner.txt
Or, since you mentioned Python,
code00.py:
import os
from zipfile import ZipFile
def extract_without_folder(arc_name, full_item_name, folder):
with ZipFile(arc_name) as zf:
file_data = zf.read(full_item_name)
with open(os.path.join(folder, os.path.basename(full_item_name)), "wb") as fout:
fout.write(file_data)
if __name__ == "__main__":
extract_without_folder("a.zip", "folder1/folder2/inner.txt", "target")
The zip doesn't have a folder structure in the same way as on the filesystem - each file has a name that is its entire path.
You'll want to use a method that allows you to read the file contents (such as zipfile.open or zipfile.read), extract the part of the filename you actually want to use, and save the file contents to that file yourself.

How to empty(delete all contents) all files in a directory without deleting the files?

Let's say i have 1000 txt files and I have to empty all the contents inside the file and keep just 1000 files with no content in it.
I was trying to use cat /dev/null > *.txt in shell but i was getting -bash: *.txt: ambiguous redirect and the files were not emptied.Any suggestions please ?
In Python you would write
import glob
for filename in glob.glob('*.txt'):
with open(filename, 'w') as f:
f.write("")
You could use a bash for loop.
for file in `ls`; do echo "" > $file; done
Now an explanation of the different parts:
for file in `ls`
This creates a for loop for all the files in the directory, assigning each filename in turn to the variable $file.
do echo "" > $file
This outputs an empty character string ("") into each file, overwriting its contents.
done
This ends the for loop.
Beware though, this is a destructive command and will clear every file in the directory!
In shell you could empty down a file by echo an empty string to it:
echo '' > yourfile
One way like this to truncate the file:
> file

Bulk renaming of files based on lookup

I have a folder full of image files such as
1500000704_full.jpg
1500000705_full.jpg
1500000711_full.jpg
1500000712_full.jpg
1500000714_full.jpg
1500000744_full.jpg
1500000745_full.jpg
1500000802_full.jpg
1500000803_full.jpg
I need to rename the files based on a lookup from a text file which has entries such as,
SH103239 1500000704
SH103240 1500000705
SH103241 1500000711
SH103242 1500000712
SH103243 1500000714
SH103244 1500000744
SH103245 1500000745
SH103252 1500000802
SH103253 1500000803
SH103254 1500000804
So, I want the image files to be renamed,
SH103239_full.jpg
SH103240_full.jpg
SH103241_full.jpg
SH103242_full.jpg
SH103243_full.jpg
SH103244_full.jpg
SH103245_full.jpg
SH103252_full.jpg
SH103253_full.jpg
SH103254_full.jpg
How can I do this job the easiest? Any one can write me a quick command or script which can do this for me please? I have a lot of these image files and manual change isnt feasible.
I am on ubuntu but depending on the tool I can switch to windows if need be. Ideally I would love to have it in bash script so that I can learn more or simple perl or python.
Thanks
EDIT: Had to Change the file names
Here's a simple Python 2 script to do the rename.
#!/usr/bin/env python
import os
# A dict with keys being the old filenames and values being the new filenames
mapping = {}
# Read through the mapping file line-by-line and populate 'mapping'
with open('mapping.txt') as mapping_file:
for line in mapping_file:
# Split the line along whitespace
# Note: this fails if your filenames have whitespace
new_name, old_name = line.split()
mapping[old_name] = new_name
suffix = '_full'
# List the files in the current directory
for filename in os.listdir('.'):
root, extension = os.path.splitext(filename)
if not root.endswith(suffix):
# File doesn't end with this suffix; ignore it
continue
# Strip off the number of characters that make up suffix
stripped_root = root[:-len(suffix)]
if stripped_root in mapping:
os.rename(filename, ''.join(mapping[stripped_root] + suffix + extension))
Various bits of the script are hard-coded that really shouldn't be. These include the name of the mapping file (mapping.txt) and the filename suffix (_full). These could presumably be passed in as arguments and interpreted using sys.argv.
This will work for your problem:
#!/usr/bin/perl
while (<DATA>) {
my($new, $old) = split;
rename("$old.jpg", "$new.jpg")
|| die "can't rename "$old.jpg", "$new.jpg": $!";
}
__END__
SH103239 1500000704
SH103240 1500000705
SH103241 1500000711
SH103242 1500000712
SH103243 1500000714
SH103244 1500000744
SH103245 1500000745
SH103252 1500000802
SH103253 1500000803
SH103254 1500000804
Switch to ARGV from DATA to read the lines from a particular input file.
Normally for mass rename operations, I use something more like this:
#!/usr/bin/perl
# rename script by Larry Wall
#
# eg:
# rename 's/\.orig$//' *.orig
# rename 'y/A-Z/a-z/ unless /^Make/' *
# rename '$_ .= ".bad"' *.f
# rename 'print "$_: "; s/foo/bar/ if <STDIN> =~ /^y/i' *
# find /tmp -name '*~' -print | rename 's/^(.+)~$/.#$1/'
($op = shift) || die "Usage: rename expr [files]\n";
chomp(#ARGV = <STDIN>) unless #ARGV;
for (#ARGV) {
$was = $_;
eval $op;
die if $#; # means eval `failed'
rename($was,$_) unless $was eq $_;
}
I’ve a more full-featured version, but that should suffice.
#!/bin/bash
for FILE in *.jpg; do
OLD=${FILE%.*} # Strip off extension.
NEW=$(awk -v "OLD=$OLD" '$2==OLD {print $1}' map.txt)
mv "$OLD.jpg" "$NEW.jpg"
done
A rewrite of Wesley's using generators:
import os, os.path
with open('mapping.txt') as mapping_file:
mapping = dict(line.strip().split() for line in mapping_file)
rootextiter = ((filename, os.path.splitext(filename)) for filename in os.listdir('.'))
mappediter = (
(filename, os.path.join(mapping[root], extension))
for filename, root, extension in rootextiter
if root in mapping
)
for oldname, newname in mappediter:
os.rename(oldname, newname)
This is very straightforward to do in Bash assuming that there's an entry in the lookup file for each file and each file has a lookup entry.
#!/bin/bash
while read -r to from
do
if [ -e "${from}_full.jpg" ]
then
mv "${from}_full.jpg" "${to}_full.jpg"
fi
done < lookupfile.txt
If the lookup file has many more entries than there are files then this approach may be inefficient. If the reverse is true then an approach that iterates over the files may be inefficient. However, if the numbers are close then this may be the best approach since it doesn't have to actually do any lookups.
If you'd prefer a lookup version that's pure-Bash:
#!/bin/bash
while read -r to from
do
lookup[from]=$to
done < lookupfile.txt
for file in *.jpg
do
base=${file%*_full.jpg}
mv "$file" "${lookup[base]}_full.jpg"
done
I modified Wesley's Code to work for my specific situation. I had a mapping file "sort.txt" that consisted of different .pdf files and numbers to indicate the order that I want them in based on an output from DOM manipulation from a website. I wanted to combine all these separate pdf files into a single pdf file but I wanted to retain the same order they are in as they are on the website. So I wanted to append numbers according to their tree location in a navigation menu.
1054 spellchecking.pdf
1055 using-macros-in-the-editor.pdf
1056 binding-macros-with-keyboard-shortcuts.pdf
1057 editing-macros.pdf
1058 etc........
Here is the Code I came up with:
import os, sys
# A dict with keys being the old filenames and values being the new filenames
mapping = {}
# Read through the mapping file line-by-line and populate 'mapping'
with open('sort.txt') as mapping_file:
for line in mapping_file:
# Split the line along whitespace
# Note: this fails if your filenames have whitespace
new_name, old_name = line.split()
mapping[old_name] = new_name
# List the files in the current directory
for filename in os.listdir('.'):
root, extension = os.path.splitext(filename)
#rename, put number first to allow for sorting by name and
#then append original filename +e extension
if filename in mapping:
print "yay" #to make coding fun
os.rename(filename, mapping[filename] + filename + extension)
I didn't have a suffix like _full so I didn't need that code. Other than that its the same code, I've never really touched python so this was a good learning experience for me.
Read in the text file, create a hash with the current file name, so files['1500000704'] = 'SH103239' and so on. Then go through the files in the current directory, grab the new filename from the hash, and rename it.
Here's a fun little hack:
paste -d " " lookupfile.txt lookupfile.txt | cut -d " " -f 2,3 | sed "s/\([ ]\|$\)/_full.jpg /g;s/^/mv /" | sh
import os,re,sys
mapping = <Insert your mapping here> #Dictionary Key value entries (Lookup)
for k,v in mapping:
for f in os.listdir("."):
if re.match('1500',f): #Executes code on specific files
os.rename(f,f.replace(k,v))

Categories