Bulk renaming of files based on lookup - python

I have a folder full of image files such as
1500000704_full.jpg
1500000705_full.jpg
1500000711_full.jpg
1500000712_full.jpg
1500000714_full.jpg
1500000744_full.jpg
1500000745_full.jpg
1500000802_full.jpg
1500000803_full.jpg
I need to rename the files based on a lookup from a text file which has entries such as,
SH103239 1500000704
SH103240 1500000705
SH103241 1500000711
SH103242 1500000712
SH103243 1500000714
SH103244 1500000744
SH103245 1500000745
SH103252 1500000802
SH103253 1500000803
SH103254 1500000804
So, I want the image files to be renamed,
SH103239_full.jpg
SH103240_full.jpg
SH103241_full.jpg
SH103242_full.jpg
SH103243_full.jpg
SH103244_full.jpg
SH103245_full.jpg
SH103252_full.jpg
SH103253_full.jpg
SH103254_full.jpg
How can I do this job the easiest? Any one can write me a quick command or script which can do this for me please? I have a lot of these image files and manual change isnt feasible.
I am on ubuntu but depending on the tool I can switch to windows if need be. Ideally I would love to have it in bash script so that I can learn more or simple perl or python.
Thanks
EDIT: Had to Change the file names

Here's a simple Python 2 script to do the rename.
#!/usr/bin/env python
import os
# A dict with keys being the old filenames and values being the new filenames
mapping = {}
# Read through the mapping file line-by-line and populate 'mapping'
with open('mapping.txt') as mapping_file:
for line in mapping_file:
# Split the line along whitespace
# Note: this fails if your filenames have whitespace
new_name, old_name = line.split()
mapping[old_name] = new_name
suffix = '_full'
# List the files in the current directory
for filename in os.listdir('.'):
root, extension = os.path.splitext(filename)
if not root.endswith(suffix):
# File doesn't end with this suffix; ignore it
continue
# Strip off the number of characters that make up suffix
stripped_root = root[:-len(suffix)]
if stripped_root in mapping:
os.rename(filename, ''.join(mapping[stripped_root] + suffix + extension))
Various bits of the script are hard-coded that really shouldn't be. These include the name of the mapping file (mapping.txt) and the filename suffix (_full). These could presumably be passed in as arguments and interpreted using sys.argv.

This will work for your problem:
#!/usr/bin/perl
while (<DATA>) {
my($new, $old) = split;
rename("$old.jpg", "$new.jpg")
|| die "can't rename "$old.jpg", "$new.jpg": $!";
}
__END__
SH103239 1500000704
SH103240 1500000705
SH103241 1500000711
SH103242 1500000712
SH103243 1500000714
SH103244 1500000744
SH103245 1500000745
SH103252 1500000802
SH103253 1500000803
SH103254 1500000804
Switch to ARGV from DATA to read the lines from a particular input file.
Normally for mass rename operations, I use something more like this:
#!/usr/bin/perl
# rename script by Larry Wall
#
# eg:
# rename 's/\.orig$//' *.orig
# rename 'y/A-Z/a-z/ unless /^Make/' *
# rename '$_ .= ".bad"' *.f
# rename 'print "$_: "; s/foo/bar/ if <STDIN> =~ /^y/i' *
# find /tmp -name '*~' -print | rename 's/^(.+)~$/.#$1/'
($op = shift) || die "Usage: rename expr [files]\n";
chomp(#ARGV = <STDIN>) unless #ARGV;
for (#ARGV) {
$was = $_;
eval $op;
die if $#; # means eval `failed'
rename($was,$_) unless $was eq $_;
}
I’ve a more full-featured version, but that should suffice.

#!/bin/bash
for FILE in *.jpg; do
OLD=${FILE%.*} # Strip off extension.
NEW=$(awk -v "OLD=$OLD" '$2==OLD {print $1}' map.txt)
mv "$OLD.jpg" "$NEW.jpg"
done

A rewrite of Wesley's using generators:
import os, os.path
with open('mapping.txt') as mapping_file:
mapping = dict(line.strip().split() for line in mapping_file)
rootextiter = ((filename, os.path.splitext(filename)) for filename in os.listdir('.'))
mappediter = (
(filename, os.path.join(mapping[root], extension))
for filename, root, extension in rootextiter
if root in mapping
)
for oldname, newname in mappediter:
os.rename(oldname, newname)

This is very straightforward to do in Bash assuming that there's an entry in the lookup file for each file and each file has a lookup entry.
#!/bin/bash
while read -r to from
do
if [ -e "${from}_full.jpg" ]
then
mv "${from}_full.jpg" "${to}_full.jpg"
fi
done < lookupfile.txt
If the lookup file has many more entries than there are files then this approach may be inefficient. If the reverse is true then an approach that iterates over the files may be inefficient. However, if the numbers are close then this may be the best approach since it doesn't have to actually do any lookups.
If you'd prefer a lookup version that's pure-Bash:
#!/bin/bash
while read -r to from
do
lookup[from]=$to
done < lookupfile.txt
for file in *.jpg
do
base=${file%*_full.jpg}
mv "$file" "${lookup[base]}_full.jpg"
done

I modified Wesley's Code to work for my specific situation. I had a mapping file "sort.txt" that consisted of different .pdf files and numbers to indicate the order that I want them in based on an output from DOM manipulation from a website. I wanted to combine all these separate pdf files into a single pdf file but I wanted to retain the same order they are in as they are on the website. So I wanted to append numbers according to their tree location in a navigation menu.
1054 spellchecking.pdf
1055 using-macros-in-the-editor.pdf
1056 binding-macros-with-keyboard-shortcuts.pdf
1057 editing-macros.pdf
1058 etc........
Here is the Code I came up with:
import os, sys
# A dict with keys being the old filenames and values being the new filenames
mapping = {}
# Read through the mapping file line-by-line and populate 'mapping'
with open('sort.txt') as mapping_file:
for line in mapping_file:
# Split the line along whitespace
# Note: this fails if your filenames have whitespace
new_name, old_name = line.split()
mapping[old_name] = new_name
# List the files in the current directory
for filename in os.listdir('.'):
root, extension = os.path.splitext(filename)
#rename, put number first to allow for sorting by name and
#then append original filename +e extension
if filename in mapping:
print "yay" #to make coding fun
os.rename(filename, mapping[filename] + filename + extension)
I didn't have a suffix like _full so I didn't need that code. Other than that its the same code, I've never really touched python so this was a good learning experience for me.

Read in the text file, create a hash with the current file name, so files['1500000704'] = 'SH103239' and so on. Then go through the files in the current directory, grab the new filename from the hash, and rename it.

Here's a fun little hack:
paste -d " " lookupfile.txt lookupfile.txt | cut -d " " -f 2,3 | sed "s/\([ ]\|$\)/_full.jpg /g;s/^/mv /" | sh

import os,re,sys
mapping = <Insert your mapping here> #Dictionary Key value entries (Lookup)
for k,v in mapping:
for f in os.listdir("."):
if re.match('1500',f): #Executes code on specific files
os.rename(f,f.replace(k,v))

Related

how to split a full file path into a path and a file name without an extension

how to split a full file path into a path and a file name without an extension?
I'm looking for any files with the extension .conf:
find /path -name .conf
/path/file1.conf
/path/smth/file2.conf
/path/smth/file3.conf
/path/smth/smth1/.conf
...
/path/smt//*.conf
I need the output in string(without extension .conf):
/path;file1|path/smth;file2;file3|...
What's the best way to do it?
I was thinking of a solution - save the output of the find work to a file and process them in a loop..but maybe there is a more effective way.
Sorry for mistakes, I newbie..
Thanx for u feedback, guys!
since you mentioned .conf, does this help?
kent$ basename -s .conf '/path/smth/file2.conf'
file2
kent$ dirname '/path/smth/file2.conf'
/path/smth
To do this in Bash:
find /path/ -type f -name "*.conf"
Note that if you want to do this in a Bash script, you can store /path/ in a variable, for instance one named directory, and change the command like so:
find $directory -type f -name "*.conf"
To do this in Python:
import os
PATH = /path/
test_files = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames
if os.path.splitext(f)[1] == '.json']
There are some other ways to do this in Python listed here as well
bash parameter parsing is easy, fast, and lightweight.
for fp in /path/file1.conf /path/smth/file2.conf /path/smth/file3.conf; do
p="${fp%/*}" # % strips the pattern from the end (minimal, non-greedy)
f="${fp##*/}" # ## strips the pattern from the beginning (max-match, greedy)
f="${f%.*}" # end-strip the already path-cleaned filename to remove extention
echo "$p, $f"
done
/path, file1
/path/smth, file2
/path/smth, file3
To get what you apparently want as your formatting -
declare -A paths # associative array
while read -r fp; do
p=${fp%/*} f=${fp##*/}; # preparse path and filename
paths[$p]="${paths[$p]};${f%.*}"; # p as key, stacked/delimited val
done < file
Then stack/delimit your datasets.
for p in "${!paths[#]}"; do printf "%s|" "$p${paths[$p]}"; done; echo
/path;file1|/path/smth;file2;file3|
For each key, print key/val and a delimiter. echo at end for a newline.
If you don't want the trailing pipe, assign it all to one var in the second loop instead of printing it out, and trim the trailing pipe at the end.
$: for p in "${!paths[#]}"; do out="$out$p${paths[$p]}|"; done; echo "${out%|}"
/path;file1|/path/smth;file2;file3
Some folk will tell you not to use bash for anything this complex. Be aware that it can lead to ugly maintenance, especially if the people maintaining it behind you aren't bash experts and can't be bothered to go RTFM.
If you actually needed that embedded space in your example then your rules are inconsistent and you'll have to explain them.
if you have the file paths in a list you can do this using a dictionary with key the path and value the filename
aa=['/path/file1.conf','/path/smth/file2.conf','/path/smth/file3.conf']
f={}
for x in aa:
temp=x[:-len(".conf")].split("/")
filename=temp[-1]
path="/".join(temp[:-1])
if path in f:
f[path]=f[path]+","+filename
else:
f[path]=filename
result=""
for x in f:
result=result+str(x)+";"+f[x]+"|"
print(result)

Execute multiple *.dat files from subdirectories (bash, python)

I have the following:
I have directory with subdirectories which are filled with files. The structure is the following: /periodic_table/{Element}_lj_dat/lj_dat_sim.dat;
Each file consists of two rows (first one is the comment) and 12 columns of data.
What I would like to get is to go through all folders of elements (eg. Al, Cu etc.), open created file (for example named "mergedlj.dat" in periodic_table directory) and store all the data from each file in one adding Element name from parent directory as a first (or last) column of merged file.
The best way is to ignore the first row in each file and save only data from second row.
I am very unexperienced in bash/shell scripting, but I think this is the best way to go (Python is acceptable too!). Unfortunately I had only experience with files which are in the same folder as the script, so this is some new experience for me.
Here is the code just to find this files, but actually it doesn't do anything what I need:
find ../periodic_table/*_lj_dat/ -name lj_dat_sim.dat -print0 | while read -d $'\0' file; do
echo "Processing $file"
done
Any help will be highly appreciated!!
Here's a Python solution.
You can use glob() to get a list of the matching files and then iterate over them with fileinput.input(). fileinput.filename() lets you get the name of the file that is currently being processed, and this can be used to determine the current element whenever processing begins on a new file, as determined by fileinput.isfirstline().
The current element is added as the first column of the merge file. I've assumed that the field separator in the input files is a single space, but you can change that by altering ' '.join() below.
import re
import fileinput
from glob import glob
dir_prefix = '.'
glob_pattern = '{}/periodic_table/*_lj_dat/lj_dat_sim.dat'.format(dir_prefix)
element_pattern = re.compile(r'.*periodic_table/(.+)_lj_dat/lj_dat_sim.dat')
with open('mergedlj.dat', 'w') as outfile:
element = ''
for line in fileinput.input(glob(glob_pattern)):
if fileinput.isfirstline():
# extract the element name from the file name
element = element_pattern.match(fileinput.filename()).groups()[0]
else:
print(' '.join([element, line]), end='', file=outfile)
You can use os.path.join() to construct the glob and element regex patterns, but I've omitted that above to avoid cluttering up the answer.

how can I cat two files with a matching string in the filename?

So I have a directory with ~162K files. Half of these files have the file name "uniquenumber.fasta" and the other half of the files have the file name "uniquenumber.fasta letters". For example:
12345.fasta
12345.fasta Somebacterialtaxaname
67890.fasta
67890.fasta Someotherbacterialtaxaname
...for another many thousand "pairs"
I would like to cat together the two files that share the unique fasta number. It does not matter the order of the concatenation (i.e. which contents comes first in the newly created combined file). I have tried some renditions of grep in the command line and a few lousy python scripts but I feel like this is more of a trivial problem than I am making it. Suggestions?
Here's a solution in Python (it will work unchanged with both Python 2 and 3). This assumes that each file XXXXX.fasta has one and only one matching XXXXX.fasta stringofstuff file.
import glob
fastafiles = sorted(glob.glob("*.fasta"))
for fastafile in fastafiles:
number = fastafile.split(".")[0]
space_file = glob.glob(number + ".fasta *")
with open(fastafile, "a+") as fasta:
with open(space_file[0], "r") as fasta_space:
fasta.write("\n")
fasta.writelines(fasta_space.readlines())
Here's how it works: first, the names of all *.fasta files are put into a list (I sort the list, but it's not strictly necessary). Next, the filename is split on . and the first part (the number in the filename) is stored. Then, we search for the matching XXXXX.fasta something file and, assuming there's only one of them, we open the .fasta file in append mode and the .fasta something file in read mode. We write a newline to the end of the .fasta file, then read in the contents of the "space file" and write them to the end of the .fasta file. Since we use the with context manager, we don't need to specifically close the files when we're done.
There's probably many ways to achieve this, but the first that came to my head would be to use the unix command find.
http://en.wikipedia.org/wiki/Find#Execute_an_action
The find command will print the filename that follows the pattern you specify. Using the -name and -exec flags, you can specify what characters should be in the file name, or run an additional command to filter the output.
If I was solving this problem, I would probably cycle over all files in the directory, and run either a -name pattern or -exec pattern that would "find" the matching file. Then | the two file names to a cat and redirect that output to a new file, hopefully concatenating the two. Hope that helps!

Batch (basename) file/folder renaming using an "index"

Renaming of files and folder in batch is a question often asked but after some search I think none is similar to mine.
Background: we send some biological samples to a service provider which returns files with unique names and a table in text format containing, amongst other information, the file name and the sample that originated it:
head samples.txt
fq_file Sample_ID Sample_name Library_ID FC_Number Track_Lanes_Pos
L2369_Track-3885_R1.fastq.gz S1746_B_7_t B 7 t L2369_B_7_t 163 6
L2349_Track-3865_R1.fastq.gz S1726_A_3_t A 3 t L2349_A_3_t 163 5
L2354_Track-3870_R1.fastq.gz S1731_A_GFP_c A GFP c L2354_A_GFP_c 163 5
L2377_Track-3893_R1.fastq.gz S1754_B_7_c B 7 c L2377_B_7_c 163 7
L2362_Track-3878_R1.fastq.gz S1739_B_GFP_t B GFP t L2362_B_GFP_t 163 6
The directory structure (for 34 directories):
L2369_Track-3885_
accepted_hits.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
L2349_Track-3865_
accepted_hits.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
Goal: because the file names are meaningless and hard to interpret, I want to rename the files ending in .bam (keeping the suffix) and the folders with the correspondent sample name, re-ordered in a more suitable manner. The result should look like:
7_t_B
7_t_B..bam
deletions.bed
junctions.bed
logs
7_t_B.bam.bai
insertions.bed
left_kept_reads.info
3_t_A
3_t_A.bam
deletions.bed
junctions.bed
logs
accepted_hits.bam.bai
insertions.bed
left_kept_reads.info
I've hacked together a solution with bash and python (newbie) but it feels over-engineered. The question is whether there is a more simple/elegant way of doing it that I've missed? Solutions can be in python, bash, and R. could also be awk since I am trying to learn it. Being a relative beginner does make one complicate things.
This is my solution:
A wrapper puts it all in place and gives an idea of the workflow:
#! /bin/bash
# select columns of interest and write them to a file - basenames
tail -n +2 samples.txt | cut -d$'\t' -f1,3 >> BAMfilames.txt
# call my little python script that creates a new .sh with the renaming commmands
./renameBamFiles.py
# finally do the renaming
./renameBam.sh
# and the folders to
./renameBamFolder.sh
renameBamFiles.py:
#! /usr/bin/env python
import re
# Read in the data sample file and create a bash file that will remane the tophat output
# the reanaming will be as follows:
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B
#
# Set the input file name
# (The program must be run from within the directory
# that contains this data file)
InFileName = 'BAMfilames.txt'
### Rename BAM files
# Open the input file for reading
InFile = open(InFileName, 'r')
# Open the output file for writing
OutFileName= 'renameBam.sh'
OutFile=open(OutFileName,'a') # You can append instead with 'a'
OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")
# Loop through each line in the file
for Line in InFile:
## Remove the line ending characters
Line=Line.strip('\n')
## Separate the line into a list of its tab-delimited components
ElementList=Line.split('\t')
# separate the folder string from the experimental name
fileroot=ElementList[1]
fileroot=fileroot.split()
# create variable names using regex
folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
folderName=folderName.strip('\n')
fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])
command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName)
print command
OutFile.write(command+"\n")
# After the loop is completed, close the files
InFile.close()
OutFile.close()
### Rename folders
# Open the input file for reading
InFile = open(InFileName, 'r')
# Open the output file for writing
OutFileName= 'renameBamFolder.sh'
OutFile=open(OutFileName,'w')
OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")
# Loop through each line in the file
for Line in InFile:
## Remove the line ending characters
Line=Line.strip('\n')
## Separate the line into a list of its tab-delimited components
ElementList=Line.split('\t')
# separate the folder string from the experimental name
fileroot=ElementList[1]
fileroot=fileroot.split()
# create variable names using regex
folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
folderName=folderName.strip('\n')
fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])
command= "mv %s %s" % (folderName, fileName)
print command
OutFile.write(command+"\n")
# After the loop is completed, close the files
InFile.close()
OutFile.close()
RenameBam.sh - created by the previous python script:
#! /bin/bash
for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done
(..)
Rename renameBamFolder.sh is very similar:
mv L2369_Track-3885_R1_ 7_t_B
mv L2349_Track-3865_R1_ 3_t_A
mv L2354_Track-3870_R1_ GFP_c_A
mv L2377_Track-3893_R1_ 7_c_B
Since I am learning, I feel that some examples of different ways of doing this, and thinking about how to do it, will be very useful.
One simple way in bash:
find . -type d -print |
while IFS= read -r oldPath; do
parent=$(dirname "$oldPath")
old=$(basename "$oldPath")
new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt)
if [ -n "$new" ]; then
newPath="${parent}/${new}"
echo mv "$oldPath" "$newPath"
echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam"
fi
done
Remove the "echo"s after initial testing to get it to actually do the "mv"s.
If all of your target directories are at one level as #triplee's answer implies, then it's even simpler. Just cd to their parent directory and do:
awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt |
while read -r old new; do
echo mv "$old" "$new"
echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam"
done
In one of your expected outputs you renamed the ".bai" file, in the other you didn't and you didn't say if you want to do that or not. If you want to rename it too just add
echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai"
to whatever solution above you prefer.
Of course you can do it only in Python - and it can yield a small readble script for that.
First thing on: read the sampels.txt fil and create a map from existing file prefixes to desired mapping prefixes - the file is not formatted to use the Python CSV reader module, as the column separator is used inside the last data column.
mapping = {}
with open("samples.txt") as samples:
# throw away headers
samples.readline()
for line in samples():
# separate the columns spliting the first whitespace ocurrences:
# (either space sequences or tabs)
fields = line.split()
# skipp blank, malformed lines:
if len(fields) < 6:
continue
fq_file, sample_id, Sample_name, Library_ID, FC_Number, track_lanes_pos, *other = fields
# the [:-2] part is to trhow awauy the "R1" sufix as for the example above
file_prefix = fq_file.split(".")[0][:-2]
target_id = "_".join((Library_ID, FC_number. Sample_name))
mapping[file_prefix] = target_id
Then check the dir names, and inside each one the ".bam" files for remapping.
import os
for entry in os.listdir("."):
if entry in mapping:
dir_prefix = "./" + entry + "/")
for file_entry in os.listdir(dir_prefix):
if ".bam" in file_entry:
parts = file_entry.split(".bam")
parts[0] = mapping[entry]
new_name = ".bam".join(parts)
os.rename(dir_prefix + file_entry, dir_prefix + new_name)
os.rename(entry, mapping[entry])
Seems that you could simply read the required fields from the index file in a simple while loop. It's not obvious how the file is structured, so I am assuming that the file is whitespace-separated and that the Sample_Id is actually four fields (complex sample_id, then three components from the name). Maybe you have a tab-delimited file with internal spaces in the Sample_Id field? Anyhow, this should be easy to adapt if my assumptions are wrong.
# Skip the annoying field names
tail +1 samples.txt |
while read fq _ c a b chaff; do
dir=${fq%R1.fastq.gz}
new="${a}_${b}_$c"
echo mv "$dir"/accepted_hits.bam "$dir/$new".bam
echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai
echo mv "$dir" "$new"
done
Take out the echos if the output looks like what you want.
Here's one way using a shell script. Run like:
script.sh /path/to/samples.txt /path/to/data
Contents of script.sh:
# add directory names to an array
while IFS= read -r -d '' dir; do
dirs+=("$dir")
done < <(find $2/* -type d -print0)
# process the sample list
while IFS=$'\t' read -r -a list; do
for i in "${dirs[#]}"; do
# if the directory is in the sample list
if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then
tag="${list[3]}_${list[4]}_${list[2]}"
new="${i%/*}/$tag"
bam="$new/accepted_hits.bam"
# only change name if there's a bam file
if [ -n $bam ]; then
mv "$i" "$new"
mv "$bam" "$new/$tag.bam"
fi
fi
done
done < <(tail -n +2 $1)
Although it's not exactly what you're looking for (just thinking outside the box): you might consider a alternate "view" of your file system -- using the term "view" like a database view is to a table. You could do this via a "file system in user space", FUSE. One can do this with a number of existing utilities, but I don't know of one that just generically works with any set of files, specifically for just renaming/re-organizing. But as a concrete example of how it can be used, pytagsfs creates a virtual (fuse) file system based on rules you define, making a directory structure of files appear however you want. (Maybe this would work for you, too -- but pytagsfs is actually intended for media files.) And then you just operate on that (virtual) file system, using whatever programs normally access that data. Or, to make the virtual directory structure permanent (if pytagsfs doesn't have an option to do this already), just copy the virtual file system to into another directory (outside the virtual file system).

How to write tag deleter script in python

I want to implement a file reader (folders and subfolders) script which detects some tags and delete those tags from the files.
The files are .cpp, .h .txt and .xml And they are hundreds of files under same folder.
I have no idea about python, but people told me that I can do it easily.
EXAMPLE:
My main folder is A: C:\A
Inside A, I have folders (B,C,D) and some files A.cpp A.h A.txt and A.xml. In B i have folders B1, B2,B3 and some of them have more subfolders, and files .cpp, .xml and .h....
xml files, contains some tags like <!-- $Mytag: some text$ -->
.h and .cpp files contains another kind of tags like //$TAG some text$
.txt has different format tags: #$This is my tag$
It always starts and ends with $ symbol but it always have a comment character (//,
The idea is to run one script and delete all tags from all files so the script must:
Read folders and subfolders
Open files and find tags
If they are there, delete and save files with changes
WHAT I HAVE:
import os
for root, dirs, files in os.walk(os.curdir):
if files.endswith('.cpp'):
%Find //$ and delete until next $
if files.endswith('.h'):
%Find //$ and delete until next $
if files.endswith('.txt'):
%Find #$ and delete until next $
if files.endswith('.xml'):
%Find <!-- $ and delete until next $ and -->
The general solution would be to:
use the os.walk() function to traverse the directory tree.
Iterate over the filenames and use fn_name.endswith('.cpp') with if/elseif to determine which file you're working with
Use the re module to create a regular expression you can use to determine if a line contains your tag
Open the target file and a temporary file (use the tempfile module). Iterate over the source file line by line and output the filtered lines to your tempfile.
If any lines were replaced, use os.unlink() plus os.rename() to replace your original file
It's a trivial excercise for a Python adept but for someone new to the language, it'll probably take a few hours to get working. You probably couldn't ask for a better task to get introduced to the language though. Good Luck!
----- Update -----
The files attribute returned by os.walk is a list so you'll need to iterate over it as well. Also, the files attribute will only contain the base name of the file. You'll need to use the root value in conjunction with os.path.join() to convert this to a full path name. Try doing just this:
for root, d, files in os.walk('.'):
for base_filename in files:
full_name = os.path.join(root, base_filename)
if full_name.endswith('.h'):
print full_name, 'is a header!'
elif full_name.endswith('.cpp'):
print full_name, 'is a C++ source file!'
If you're using Python 3, the print statements will need to be function calls but the general idea remains the same.
Try something like this:
import os
import re
CPP_TAG_RE = re.compile(r'(?<=// *)\$[^$]+\$')
tag_REs = {
'.h': CPP_TAG_RE,
'.cpp': CPP_TAG_RE,
'.xml': re.compile(r'(?<=<!-- *)\$[^$]+\$(?= *-->)'),
'.txt': re.compile(r'(?<=# *)\$[^$]+\$'),
}
def process_file(filename, regex):
# Set up.
tempfilename = filename + '.tmp'
infile = open(filename, 'r')
outfile = open(tempfilename, 'w')
# Filter the file.
for line in infile:
outfile.write(regex.sub("", line))
# Clean up.
infile.close()
outfile.close()
# Enable only one of the two following lines.
os.rename(filename, filename + '.orig')
#os.remove(filename)
os.rename(tempfilename, filename)
def process_tree(starting_point=os.curdir):
for root, d, files in os.walk(starting_point):
for filename in files:
# Get rid of `.lower()` in the following if case matters.
ext = os.path.splitext(filename)[1].lower()
if ext in tag_REs:
process_file(os.path.join(root, base_filename), tag_REs[ext])
Nice thing about os.splitext is that it does the right thing for filenames that start with a ..

Categories