File renaming; Can I get some feedback

File renaming; Can I get some feedback - python

Background: A friend of mine, who just might have some OCD issues, was telling me a story of how he was not looking forward to the hours of work he was about to invest into renaming tons of song files that had the words An, The, Of and many more capitalized.
Criteria: He gave me a list of words, omitted here because you will see them in the code, and told me that capitalization is O.K. if they are at the beginning of the song, but otherwise they must be lowercase.
Question 1: This is actually my first script and I am looking for some feedback. If there is a better way to write this, I would like to see it so I can improve my coding. The script is functional and does exactly what I would like it to do.
Question 2: Initially I did not have all 3 functions. I only had the function that replaced words. For some reason it would not work on files that looked like this "The Dark Side Of The Moon". When I ran the code, the "Of" would be replaced but neither of the "The"s would be. So, through trial and error I found that if I lowercase the first letter of the file, do my replace function and finally uppercase the file, it would work. Any clue as to why?
import os
words = ['A','An','The','Of','For','To','By','Or','Is','In','Out','If','Oh','And','On','At']
fileList = []
rootdir = ''
#Where are the files? Is the input a valid directory?
while True:
rootdir = raw_input('Where is your itunes library? ')
if os.path.isdir(rootdir): break
print('That is not a valid directory. Try again.')
#Get a list of all the files in the directory/sub-directory's
for root, subFolders, files in os.walk(rootdir):
for file in files:
fileList.append(os.path.join(root))
#Create a function that replaces words.
def rename(a,b,c):
for file in os.listdir(c):
if file.find(a):
os.rename(file,file.replace(a,b))
#Create a function that changes the first letter in a filename to lowercase.
def renameL():
for file in os.listdir(os.getcwd()):
if file.find(' '):
os.rename(file,file.replace(file,file[0].lower()+file[1:]))
#Creat a function that changes the first letter in a filename to uppercase.
def renameU():
for file in os.listdir(os.getcwd()):
if file.find(' '):
os.rename(file,file.replace(file,file[0].upper()+file[1:]))
#Change directory/lowercase the first letter of the filename/replace the offending word/uppercase the first letter of the filename.
for x in fileList:
for y in words:
os.chdir(x)
renameL()
rename(y,y.lower(),x)
renameU()
Exit = raw_input('Press enter to exit.')

OK, some criticisms:
Don't prompt for arguments, get them from the command line. It makes testing, scripting, and a lot of other things much easier.
The implementation you've got wouldn't distinguish, e.g. "the" from "theater"
You're using the current-working-directory to pass around the directory you're working on. Don't do this, just use a variable.
Someone else said, "use set, it's faster". That advice is incorrect; the correct advice is "use set, because you need a set". A set is a unordered collection of unique items (a list is an ordered collection of not-necessarily-unique items.) As a bonus for using the right collection, your program will probably run faster.
You need to properly split up the work you're trying to do. I'll explain:
Your program has two parts: 1. You need to loop through all the files in some directory and rename them according to some rule. 2. The Rule, given a string (yeah, it's going to be a file name, but forget about that), capitalize the first word and all of the subsequent words that aren't in some given set.
You've got (1) down pretty pat, so dig further into (2). The steps there are a. knock everything down to lower-case. b. Break the string into words. c. For each word, capitalize it if you're supposed to. d. Join the words back into a string.
Write (2) and write a test program that calls it to make sure it works properly:
assert capitalizeSongName('the Phantom Of tHe OPERA') == 'The Phantom of the Opera'
When you're happy with (2), write (1) and the whole thing should work.

Repeated code is usually considered bad style (DRY is the buzzword). Also I usually try not to interleave functionality.
For the "design" of this little script I would first walk the directories and create a large list of all audio files and directories. Then I write a function handling the changing of one the items in the list and create another list using map. Now you have a current and a want list. Then I would zip those lists together and rename them all.
If your Music Library is really huge, you can use itertools, so you don't have large lists in memory but iterators (only one item in memory at once). This is really easy in python: use imap instead of map and izip instead of zip.
To give you an impression and a few hints to useful functions, here is a rough sketch of how I would do it. (Warning: untested.)
import os
import sys
words = ['A','An','The','Of','For','To','By','Or','Is','In','Out','If','Oh','And','On','At']
wantWords = map(str.lower, words)
def main(args):
rootdir = args[1]
files = findFiles(rootdir)
wantFiles = map(cleanFilename, files)
rename(files, wantFiles)
def findFiles(rootdir):
result = []
for root, subFolders, files in os.walk(rootdir):
for filename in files:
result.append(os.path.join(root, filename))
return result
def cleanFilename(filename):
# do replacement magic
def rename(files, wantFiles):
for source, target in zip(files, wantFiles):
os.rename(source, target)
if __name__ == '__main__':
main(sys.argv)
The advantage is that you can see in the main() what is happening without looking into the details of the functions. Each function does different stuff. On only walks the filesystem, on only changes one filename, one actually renames files.

Use a set instead of a list. (It's faster)
I'm not really sure what you're trying to do there. The approach I took was just to lowercase the whole thing, then uppercase the first letter of each word as long as that word isn't in the set, and then uppercase the very first letter regardless (just in case it was one of those special words).
C# version I wrote a little while ago:
private static HashSet<string> _small = new HashSet<string>(new[] { "of", "the", "and", "on", "sur", "de", "des", "le", "la", "les", "par", "et", "en", "aux", "d", "l", "s" });
static string TitleCase(string str)
{
if (string.IsNullOrEmpty(str)) return string.Empty;
return string.Concat(char.ToUpper(str[0]),
Regex.Replace(str, #"\w+", m =>
{
string lower = m.Value.ToLower();
return _small.Contains(lower)
? lower
: string.Concat(char.ToUpper(lower[0]), lower.Substring(1));
})
.Substring(1));
}
I used a regex instead of splitting on spaces because I had a lot of french words in there that were separated by 's instead.

Related

how do I add phrases to a file without deleting the file

The text file which is a "txt" file. Also, I have separate files for different length phrases (spaces count towards the phrase length) I saw phrases because it can be multiple words, but in the example below I use three letter words all of which are one word. Also, imagine each phrase is on a new line. Each phrase is separated by a comma. Imagine you have a file like this:
app,
bar,
car,
eel,
get,
pod,
What I want is to be able to add one or more phrases that will be assumed to only contain lowercase alphabetical letters and/or spaces.
For example, let us say I want to add the phrases in this order:
(cat, bat, car, hat, mom, rat)
basically, I want to add these phrases to the file without deleting
the file and making sure no phrases repeat in the file as well as making sure they are alphabetically sorted. Spaces are assumed to be after the letter z in terms of alphabetically sorting them. So after inputting these phrases, the file should look like this:
'
app,
bar,
bat,
car,
eel,
get,
hat,
mom,
pod,
rat
'
And each file will be assumed to become at least a gigabyte of data. What is the fastest/least memory consuming/etc. So copying the file in order to accomplish this is a no go.
I haven't tried anything that 100% works. I know what to do, I just don't know how to do it. Here are the main points that I need to accomplish.
1) Make sure the phrase(s) are created (using input() function)
2) Open the file of organized words (using "with open(filename)" statements)
3) Put each phrase into the "correct" spot in the file. By "correct" I mean that is alphabetical and is not a repeat.
4) Make sure the file doesn't get deleted.
Here is what I have currently (changed it a bit and it is doing MORE of what I want, but not everything):
phrase_to_add = input('Please enter the phrase: ').lower()
with open('/Users/ian/Documents/three_character_phrases.txt') as file:
unique_phrases = list(file.read().split())
unique_phrases.append(phrase_to_add)
unique_phrases.sort()
list_of_phrases = set()
for phrase in unique_phrases:
list_of_phrases.add(phrase)
with open('/Users/ian/Documents/three_character_phrases.txt', 'w') as fin:
for phrase in list_of_phrases:
fin.write(phrase + '\n')
So I started with BOTH files being empty and I added the word 'cow' by putting it into the input and this what the file looked like:
three_character_phrases.txt:
cow
then I inputted the word "bat" and I got this:
bat
cow
then I added the word "bawk" (I know it isn't a 3 letter word but I'll take care of making sure the right words go into the right files)
I got this:
bawk
bat
cow

It looks like you're getting wrapped up in the implementation instead of trying to understand the concept, so let me invite you to take a step back with me.
You have a data structure that resembles a list (since order is relevant) but allows no duplicates.
['act', 'bar', 'dog']
You want to add an entry to that list
['act', 'bar', 'cat', 'dog']
and serialize the whole thing to file afterwards so you can use the same data between multiple sessions.
First up is to establish your serialization method. You've chosen a plain text file, line delimited. There's nothing wrong with that, but if you were looking for alternatives then a csv, a json, or indeed serializing directly to database might be good too. Let's proceed forward under the assumption that you won't change serialization schemas, though.
It's easy to read from file
from pathlib import Path
FILEPATH = Path("/Users/ian/Documents/three_character_phrases.txt")
def read_phrases():
with FILEPATH.open(mode='r') as f:
return [line.strip() for line in f]
and it's easy to write to it, too.
# Assume FILEPATH is defined here, and in all future snippets as well.
def write_phrases(phrases):
with FILEPATH.open(mode='w') as f:
f.writelines(f'{phrase}\n' for phrase in phrases)
# this is equivalent to:
# text = '\n'.join(phrases)
# f.write(text + '\n')
You've even figured out how to have the user enter a new value (though your algorithm could use work to make the worst case better. Since you're always inserting into a sorted list, the bisect stdlib module can help your performance here for large lists. I'll leave that for a different question though).
Since you've successfully done all the single steps, the only thing holding you back is to put them all together.
phrases = read_phrases()
phrase_to_add = input('Please enter the phrase: ').lower()
if phrase_to_add not in phrases:
phrases.append(phrase_to_add)
phrases.sort() # this is, again, not optimal. Look at bisect!
write_phrases(phrases)

Reg.sub regex help in Python to normalize directory/file to play nice with Windows

Very new here, and I am trying to modify some python code to normalize directory/file names for Windows using regular expression. I have searched and found lots of code examples, but haven’t quite figured out how to put it all together.
This is what I am trying to accomplish:
I need to remove all invalid Windows characters so directory/file names do not include: < > : " / \ | ? *
Windows also doesn’t seem to like spaces at the end of a directory/file name. Windows also doesn’t like periods at the end of directory names.
So, I need to get rid of ellipsis without affecting the extension. To clarify, when I say ellipsis, I am referring to a pattern of three periods, and NOT the single unicode character “Horizontal Ellipsis (U+2026)”. I have researched and found multiple ways of doing individual parts of this, but I cannot see to get it all together and playing nice.
return unicode(re.sub(r'[<>:"/\\|?*]', "", filename)
This cleans up the names, but not the pattern of two or more periods.
return unicode(re.sub(r'[<>:"/\\|?*.]', "", filename)
This cleans up the names, but also affects the file extension.
[^\w\-_\. ]
This also seemed to be a viable alternative. It is a bit more restrictive than necessary, but I did find it easy to just keep adding specific characters I wanted to ignore.
\.{2,}
This is the piece I can’t seem to get to integrate with any of these methods. I understand that this should match two or more “.”, but leave a single “.” alone. But there are some situations where I “might” be left with a period at the end of a Windows directory name, which won’t work.
.*[.](?!mp3$)[^.]*$
I searched and found this specific snippet, which looks promising to match/ignore a specific extension. In my case, I want .mp3 left alone. Maybe a different way to go about things. And I think it might eliminate a potential problem of having a period at the end of a directory name.
Thank you for your time!
Edit: Additional Information Added
def normalize_filename(self, filename):
"""Remove invalid characters from filename"""
return unicode(re.sub(r'[<>:"/\\|?*]', "", filename))
def get_outfile(self):
"""Returns output filename based on song information"""
destination_dir = os.path.join(self.normalize_filename(self.info["AlbumArtist"]),
self.normalize_filename(self.info["Album"]))
filename = u"{TrackNumber:02d} - {Title}.mp3".format(**self.info)
return os.path.join(destination_dir, self.normalize_filename(filename))
This is the relevant code I am trying to modify. The full code basically pulls song artist, album, and track descriptions out of a sqlite database file. Then based on that information, it creates an artist directory, album directory, and a mp3 file.
However, because of Windows naming restrictions, those names need to be normalized/sanitized.
Ideally I would like this to be done with a single re.sub, if it can be done.
return unicode(re.sub(r'[<>:"/\|?*]', "", filename))
If there is another/better way to make this code work, I am open to it. But with my limited understanding, adding more complexity was beyond me, so I was trying to work within the bounds of what I currently understand. I have done a lot of reading over the past few days, but can’t quite accomplish what I would like to do.
For Example: “Ned’s Atomic Dustbin\ARE YOU NORMAL?\Not Sleeping Around” needs to become C:\Ned’s Atomic Dustbin\ARE YOU NORMAL\Not Sleeping Around.mp3
Another: “Green Day\UNO... DOS... TRÉ!\F*** Time” needs to become C:\Green Day\UNO DOS TRÉ\F Time.mp3”
Another: “Incubus\A Crow Left Of The Murder…\Pistola” would become C:\Incubus\A Crow Left Of The Murder\Pistola.mp3
Tricky Example: “System Of A Down\B.Y.O.B.\B.Y.O.B.” to C:\System Of A Down\BYOB\BYOB.mp3” Windows wouldn’t care if it was B.Y.O.B, but the last period is what causes issues. So it would probably be best if the solution eliminated all “.”, except on the extension .mp3.

My answer is totally based on the text below (you typed, of course):
I need to remove all invalid Windows characters so directory/file names do not include: < > : " / \ | ? * Windows also doesn’t seem to like spaces at the end of a directory/file name. Windows also doesn’t like periods at the end of directory names.
So here we go (for file/directory):
unicode(re.sub(r'(\<|\>|\:|\"|\/|\\|\||\?|\*', '', file/directory))
Explanation:
\<|\>|\:|\"|\/|\\|\||\?|\* <= matches alll of your undesired chars
At this time you will have erased all of your undesired chars EXCEPT the spaces/dots at the end of the name.
For yours file_name you can update its variable with
file_name = re.sub(r'( +)$', '', file_name)
( +)$ <= matches spaces or a dot at the end of the string.
and you'll be done because there are no more restrictions besides that the name can't contain any spaces at its end (remember we already removed the special chars).
For directories however, you can't have both periods and spaces.
So the best way, my opinion of course, is to implement a recursive procedure, once that stops only when:
dir_name == re.sub(r'( +|\.+)$', '', dir_name)
and dir_name keeps being updated with dir_name = re.sub(r'( +|\.+)$', '', dir_name) while the above statement is false.
Hope this helps you.

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.

If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

Putting parts of a text file into a list

I have this text file and I need certain parts of it to be inserted into a list.
The file looks like:
blah blah
.........
item: A,B,C.....AA,BB,CC....
Other: ....
....
I only need to rip out the A,B,C.....AA,BB,CC..... parts and put them into a list. That is, everything after "Item:" and before "Other:"
This can be easily done with small input, but the problem is that it may contain a large number of items and text file may be pretty huge. Would using rfind and strip be as efficient for huge input as for small input, algorithmically speaking?
What would be an efficient way to do it?

I can see no need for rfind() nor strip().
It looks like you're simply trying to do:
start = 'item: '
end = 'Other: '
should_append = False
the_list = []
for line in open('file').readlines():
if line.startswith(start):
data = line[len(start):]
the_list.append(data)
should_append = True
elif line.startswith(end):
should_append = False
break
elif should_append:
the_list.append(line)
print the_list
This doesn't hold the whole file in memory, just the current line and the list of lines found between the start and the end patterns.

To answer the question about efficiency specifically, reading in the file and comparing it line by line will net O(n) average case performance.
Example by Code:
pattern = "item:"
with open("file.txt", 'r') as f:
for line in f:
if line.startswith(pattern):
# You can do what you like with it; split it along whitespace or a character, then put it into a list.
You're searching the entire file sequentially, and you have to compare some number of elements in the file before you come across the element you're looking for.
You have the option of building a search tree instead. While it costs O(n) to build, it would cost O(logkn) time to search (resulting in O(n) time overall, again), where k is the number of starting characters you'd have in your list.

Though I usually jump at the chance to employ regular expressions, I feel like for a single occurrence in a large file, it would be much more work and too computationally expensive to use regex. So perhaps the straightforward answer (in python) would be most appropriate:
s = 'item:'
yourlist = next(line[len(s)+1:].split(',') for line in open("c:\zzz.txt") if line.startswith(s))
This, of course, assumes that 'item:' doesn't exist on any other lines that are NOT followed by 'other:', but in the event 'item:' exists only once and at the start of the line, this simple generator should work for your purposes.

This problem is simple enough that it really only has two states, so you could just use a Boolean variable to keep track of what you are doing. But the general case for problems like this is to write a state machine that transitions from one state to the next until it has worked its way through the problem.
I like to use enums for states; unfortunately Python doesn't really have a built-in enum. So I am using a class with some class variables to store the enums.
Using the standard Python idiom for line in f (where f is the open file object) you get one line at a time from the text file. This is an efficient way to process files in Python; your initial lines, which you are skipping, are simply discarded. Then when you collect items, you just keep the ones you want.
This answer is written to assume that "item:" and "Other:" never occur on the same line. If this can ever happen, you need to write code to handle that case.
EDIT: I made the start_code and stop_code into arguments to the function, instead of hard-coding the values from the example.
import sys
class States:
pass
States.looking_for_item = 1
States.collecting_input = 2
def get_list_from_file(fname, start_code, stop_code):
lst = []
state = States.looking_for_item
with open(fname, "rt") as f:
for line in f:
l = line.lstrip()
# Don't collect anything until after we find "item:"
if state == States.looking_for_item:
if not l.startswith(start_code):
# Discard input line; stay in same state
continue
else:
# Found item! Advance state and start collecting stuff.
state = States.collecting_input
# chop out start_code
l = l[len(start_code):]
# Collect everything after "item":
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
elif state == States.collecting_input:
if not l.startswith(stop_code):
# Continue collecting input; stay in same state
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
else:
# We found our terminating condition! Don't bother to
# update the state variable, just return lst and we
# are done.
return lst
else:
print("invalid state reached somehow! state: " + str(state))
sys.exit(1)
lst = get_list_from_file(sys.argv[1], "item:", "Other:")
# do something with lst; for now, just print
print(lst)

I wrote an answer that assumes that the start code and stop code must occur at the start of a line. This answer also assumes that the lines in the file are reasonably short.
You could, instead, read the file in chunks, and check to see if the start code exists in the chunk. For this simple check, you could use if code in chunk (in other words, use the Python in operator to check for a string being contained within another string).
So, read a chunk, check for start code; if not present discard the chunk. If start code present, begin collecting chunks while searching for the stop code. In a recent Python version you can concatenate the blocks one at a time with reasonable performance. (In an old version of Python you should store the chunks in a list, then use the .join() method to join the chunks together.)
Once you have built a string that holds data from the start code to the end code, you can use .find() and .rfind() to find the start code and end code, and then cut out just the data you want.
If the start code and stop code can occur more than once in the file, wrap all of the above in a loop and loop until end of file is reached.

Spell check program in python

Exercise problem: "given a word list and a text file, spell check the
contents of the text file and print all (unique) words which aren't
found in the word list."
I didn't get solutions to the problem so can somebody tell me how I went and what the correct answer should be?:
As a disclaimer none of this parses in my python console...
My attempt:
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
#I'm aware that something is wrong here since I get an error when I use it.....when I just write blablabla.txt it says that it can't find the thing. Is this function only gonna work if I'm working off the online IVLE program where all those files are automatically linked to the console or how would I do things from python without logging into the online IVLE?
for words in data:
for words not in a
print words
wrong = words not in a
right = words in a
print="wrong spelling:" + "properly splled words:" + right
oh yeh...I'm very sure I've indented everything correctly but I don't know how to format my question here so that it doesn't come out as a block like it has. sorry.
What do you think?

There are many things wrong with this code - I'm going to mark some of them below, but I strongly recommend that you read up on Python control flow constructs, comparison operators, and built-in data types.
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
# The filename needs to be a string value - put "C:\..." in quotes!
for words in data:
# data is a string - iterating over it will give you one letter
# per iteration, not one word
for words not in a
# aside from syntax (remember the colons!), remember what for means - it
# executes its body once for every item in a collection. "not in a" is not a
# collection of any kind!
print words
wrong = words not in a
# this does not say what you think it says - "not in" is an operator which
# takes an arbitrary value on the left, and some collection on the right,
# and returns a single boolean value
right = words in a
# same as the previous line
print="wrong spelling:" + "properly splled words:" + right

I don't know what you are trying to iterate over, but why don't you just first iterate over your words (which are in the variable a I guess?) and then for every word in a you iterate over the wordlist and check whether or not that word is in the wordslist.
I won't paste code since it seems like homework to me (if so, please add the homework tag).
Btw the first argument to open() should be a string.

It's simple really. Turn both lists into sets then take the difference. Should take like 10 lines of code. You just have to figure out the syntax on your own ;) You aren't going to learn anything by having us write it for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.