Sorting lists based on very specific criteria - python

I am looking to sort a list based on if there is a "." there will be 2 items in the list, I just need to make sure it is in the correct format.
My list needs to be in the format: [item1, item2.ext]
I need the item without the extension (The folder) to be first to properly use shutil.
I know I could do 2 for loops through my list but that seems wasteful, and I know I could check to see weather or not I am pointing to a file or a folder but I think it is easier to force the list to be in the correct order explicitly.
Here is my code:
# Sets up a list of files and a blank list of those to remove
files = os.listdir(currentDir)
print(files)
files_to_remove = []
print(files_to_remove)
# Loops through files looking for
for f in files: # I could break this into 2 for loops but that seems dumb
if "." not in f: #checks if is a folder
files_to_remove.append(f)
print(files_to_remove)
if f"lecture{lecturenum}.zip" in f: #Checks if is the zip file that was just unzipped
files_to_remove.append(f)
print(files_to_remove)
print(files_to_remove)
time.sleep(0.1) # This removes a synching error where windows is trying to delete it while it is still using it
shutil.rmtree(currentDir + files_to_remove[0]) # I could change this to check before deciding
os.remove(currentDir + files_to_remove[1])
Any help would be greatly appreciated

Related

Python: Need help comparing two lists

I recently discovered a huge bug in my script and I need some help figuring it out.
This script searches through multiple sub-directories in our codebase and finds all source files that fall into a certain criteria and makes a list of the names. I call this list: all_files
It then runs another program, and this program makes new source files (close to the same names) that get placed in another directory. I go into that directory and make a list of the file names in there. I call this list: covered_files
Now here's the problem, the new source file names differ in that they have leading words. For example:
--> file name in all_files: fileName.cpp
--> file name in covered_files: prefix_fileName.cpp
These two files correspond to each other, and the job of this script is to return a list of names from all_files that show up in covered_files... That is, in the above example the name "fileName.cpp" would be added to the list because it is in both.
One main issue is that there are names that would correspond to "fileName1.cpp", fileName2.cpp", and so on. And as you'll see in my code below, it only accounts for one of these files being covered when they both need to be.
What I currently have:
def find_covered_files(all_files):
covered_path = <path to the newly generated files>
# Make a list of every file in the covered folder
covered_files = [f for f in listdir(covered_path) if isfile(join(covered_path, f))]
file_match_list = []
# Seach through the covered files for matches
for cov_file in covered_files:
# Find the file matches
for files in all_files:
if files in cov_file:
file_match_list.append(files)
return file_match_list
So overall my question is: Is there a way to search through the two lists, where one of the entries is a substring of the other, that will give me each file name that is covered regardless if that substring appears more than once? Thank you in advance for any help!
Edit: Another example using some actual file names:
There are files Segment.cpp, SegmentAllocation.cpp, and SegmentImpl.cpp These would be in the all_files list with the matching prefixed names in the covered_files list.
After the code is ran, I would expect all three of them to be in the file_match_list. However, only the first file name is in there repeated 3 times.
So instead of: (desired)
['Segment.cpp', 'SegmentAllocation.cpp', 'SegmentImpl.cpp']
I get:
['Segment.cpp', 'Segment.cpp', 'Segment.cpp']
I answered the question after debugging, I needed a break statement in the for loop:
def find_covered_files(all_files):
covered_path = <path to the newly generated files>
# Make a list of every file in the covered folder
covered_files = [f for f in listdir(covered_path) if isfile(join(covered_path, f))]
file_match_list = []
# Seach through the covered files for matches
for cov_file in covered_files:
# Find the file matches
for files in all_files:
if files in cov_file:
file_match_list.append(files)
break
return file_match_list

Is there a python module to create filetree from a list or index of paths/files (not local system)

Working on a school project... I have a python list object (obtained from a text file) that contains an entire directory listing(about 400K items). Is there a module to organize this list or text file into a file tree structure automatically?
For example, the root starts the list "/". followed by the first folder in it, all the way out to the last file in the path "/folder1/sub-folder_last/lastfile.txt
This goes all the way to the very last item "/last_folder_in_root" out to the very last sub folder in that "/last_folder_in_root/last_sub_folder/last_file.txt"
I have been searching for a good start point but the ambiguity in searching gets me nothing but os, os walk items. Hoping there is already something out there that will run through this and separate sub items with a tab or something similar.
end output would be something similar to:
/
/first_folder
/first_sub_folder
/file.txt
/second_folder
/last_folder
/last_sub_fodler
/last_file.txt
I searched through several libraries but was unable to find one that supported this. This does not involve os.walk, it's not for the local file system. It's from a txt file, or list.
Basically trying to find something similar to the os.walk output, but bring the information in from a list or file, rather than the local system. Any ideas or direction for this?
you can solve this with some logic
with open('filename.txt') as in_file:
for line in in_file.readlines():
as_list = line.split('/')
# special case for the root
if len(as_list) == 2 and as_list[0] == '' and as_list[-1] == '\n':
indent = 0
else:
indent = (len(as_list) - 1) * 4 + len(as_list[-1]) + 1
output = '/{}'.format(as_list[-1].strip('\n'))
print(output.rjust(indent))

Find and remove duplicate files using Python

I have several folders which contain duplicate files that have slightly different names (e.g. file_abc.jpg, file_abc(1).jpg), or a suffix with "(1) on the end. I am trying to develop a relative simple method to search through a folder, identify duplicates, and then delete them. The criteria for a duplicate is "(1)" at the end of file, so long as the original also exists.
I can identify duplicate okay, however I am having trouble creating the text string in the right format to delete them. It needs to be "C:\Data\temp\file_abc(1).jpg", however using the code below I end up with r"C:\Data\temp''file_abc(1).jpg".
I have looked at answers [Finding duplicate files and removing them, however this seems to be far more sophisticated than what I need.
If there are better (+simple) ways to do this then I let me know, however I only have around 10,000 files in total in 50 odd folders, so not a great deal of data to crunch through.
My code so far is:
import os
file_path = r"C:\Data\temp"
file_list = os.listdir(file_path)
print (file_list)
for file in file_list:
if ("(1)" in file):
index_no = file_list.index(file)
print("!! Duplicate file, number in list: "+str(file_list.index(file)))
file_remove = ('r"%s' %file_path+"'\'"+file+'"')
print ("The text string is: " + file_remove)
os.remove(file_remove)
Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).
Try this:
for file_name in file_list:
if "(1)" not in file_name:
continue
original_file_name = file_name.replace('(1)', '')
if not os.path.exists(os.path.join(file_path, original_file_name):
continue # do not remove files which have no original
os.remove(os.path.join(file_path, file_name))
Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:
Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
sort all files by size, then
walk linearly through this list, identify the doubles (which are neighbours in this list) and
yield each such double-group (i. e. a small list of files (typically just two) which are identical).
Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).
By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).

Move Certain Files from One Directory to Another - Python

All,
I need to move file from one directory to another but I don't want to move all the files in that directory just the text files that begin with 'pws'. A list of all the files in the directory is:
['pws1.txt', 'pws2.txt', 'pws3.txt', 'pws4.txt', 'pws5.txt', 'x.txt', 'y.txt']
As stated, I want to move the 'pws*' files to another directory but not the x and y text files. What I want to do is remove all elements the list that does not begin with 'pws'. My code is below:
loc = 'C:\Test1'
dir = os.listdir(loc)
#print dir
for i in dir:
#print i
x = 'pws*'
if i != x:
dir.remove(i)
print dir
The output does not keep what I want instead
It removes the x text file from the list and the even number ones but retains the y text files.
What am I doing wrong. How can I make a list of only the files that start with 'pws' and remove the text files that do not begin with 'pws'.
Keep in mind I might have a list that has 1000 elements and several hundreds of those elements will start with 'pws' while those that don't begin with it, couple of hundreds, will need to be removed.
Everyone's help is much appreciated.
You can use list-comprehension to re-create the list as the following:
dir = [i for i in dir if i.startswith('pws')]
or better yet, define that at start:
loc = 'C:\\Test1'
dir = [i for i in os.listdir(loc) if i.startswith('pws')]
print dir
Explanation:
When you use x = 'pws*' and then check for if i == x, you are comparing if the element i is equal to 'pws*', so a better way is to use str.startswith() built in method that will check if the string starts with the provided substring. So in your loop you can use if i.startswith('pws') or you can use list-comprehension as I mentioned above which is a more pythonic approach.
To move a file:
How to move a file in Python
You can use os.rename(origin,destination)
origin = r'C:\users\JohnDoe\Desktop'
destination = r'C:\users\JohnDoe\Desktop\Test'
startswith_ = 'pws'
And then go ahead and do a list comprehension
# Move files
[os.rename(os.path.join(origin,i),os.path.join(destination,i)) for i in os.listdir(origin) if i.startswith(startswith_)]
Just use a glob. This gives you a list of files in a directory without all the listdir calls and substring matching:
import glob,os
for f in glob.glob('/tmp/foo/pws*'):
os.rename(f, '/tmp/bar/%s'%(os.path.basename(f)))
Edit 1: Here's your original code, simplified with a glob, and a new variable loc2 defined to be the place to move them to:
import os,glob
loc = 'C:\Test1'
loc2 = 'C:\Test2'
files = glob.glob('%s\pws*'%(loc))
for i in files:
os.rename(i,'%s\%s'%(loc2,os.path.basename(i)))

Merging Two Lists Where one is a Sublist Created From Another List

True to the adage that weeks of programming will save you hours of work I started writing a little .py to backup my project files by going through all the network drives, locating a root project folder, having a look inside to see if there is a settings folder and if there is, to back it up off site.
I am firstly creating a list of the folders I am interested in:
Ant = "I:"
antD = os.listdir(Ant+'\\')
# antD is the list of all stuff on ANT's D:\
antDpath = (Ant+'\\')
namingRegex = re.compile( r"\d\d\d\d_\w[^.]*$")
# the above regex returns all folders that begin with four digits
# and a _ but do not contain a . so as not to include files
toCopyAnt = filter(namingRegex.match, antD)
#List of folders to look in
So there I have a nice list of folders I am interested in in this form:
>>>toCopyAnt
['9001_Project44_IDSF', '5015_Project0_517', '8012_Project_whatever']
Next up I go on and make this mess:
inAnt = []
for i in range(len(toCopyAnt)):
dirAnt = (antDpath+toCopyAnt[i])
inAnt.append(dirAnt)
This part above is merely to add '\' to the list so that when os.dirlist below scans the directories I get proper output
antList = []
for i in range(len(toCopyAnt)):
antL = os.listdir(inAnt[i])
antList.append(antL)
This part here goes through each of the directories I have narrowed down and lists the contents.
The list antList looks like this:
>>>antList
[['Data', 'MagicStuff', 'Info', 'settings'], ['Data', 'processing', 'Settings', 'modularTakeOff'], ['Blue_CCheese', 'Data', 'Rubbishbin', 'songBird', 'images', 'Settings', 'Cakes'], ['canDelete', 'Backup']]
it is a list of some lists... ugh.
My aim it to join these two fellas together such that I have another list which gives me the full path names for each sub directory like so in the case of toCopyAnt[2] and antList[2]
['I:\8012_Project_whatever\canDelete','I:\8012_Project_whatever\Backup']
With the final goal being to use this new list, remove strings I don't need and then parse that to tree_copy to make my life better in every way.
I figure there is likely a far more efficient way to do this and I would love to know but at the same time I would also like to tackle this issue as I have outlined it as it would increase my yield of lists greatly, I've only been at this python stuff for a few days now so I haven't got all the tools out the box just yet.
If what I am doing here is as painful and laboured as I suspect it is please point me in the right direction and I'll start over.
Ngiyabonga!
Ingwe
###########NEXT DAY EDIT###########
So I slept on this and have been attacking the logic again. Still convinced that this is messy and clumsy but I must soldier on.
Below is the first part of what I had working last night with some differenct coments:
inant = []
for i in range(len(toCopyAnt)):
dirant = (Ant+'\\'+toCopyAnt[i]) ##makes full path of higher folder
inant.append(dirant)
antList = []
for i in range(len(toCopyAnt)):
antL = os.listdir(inant[i]) ##provides names of all subfolders
antList.append(antL)
for i in range(len(antList)):
antList[i] = filter(settingRegex.match, antList[i]) ##filters out only inpho project folders
## This mess here next builds a list of full path folders to backup.
preAntpath = []
postAntpath = []
for i in range(len(inant)):
try:
preAntpath=("%s\\%s\\"%(inant[i],antList[i][0]))
postAntpath.append(preAntpath)
preAntpath=("%s\\%s\\"%(inant[i],antList[i][1]))
postAntpath.append(preAntpath)
preAntpath=("%s\\%s\\"%(inant[i],antList[i][2]))
postAntpath.append(preAntpath)
preAntpath=("%s\\%s\\"%(inant[i],antList[i][3]))
postAntpath.append(preAntpath)
except:
pass
Ok, so the reason there is 4 iterations is to just make sure that I include the backup copies of the folders I want to keep. Sometimes there is a 'settings' and a 'settings - copy'. I want both.
##
##
## some debuggering next...
for i in range(len(postAntpath)):
print postAntpath[i]
It works. I get a new list with merged data from the two lists.
so where inant[x] and antList[x] match (because they are always the same length) antList[x] will merge with inant[x][0-3].
If, as is often the case, there is no settings folder in the project folder the try: seems to take care of it. Although I will need to test some situations where the project folder is 1st or 2nd in the list. In my current case it is last. I don't know if try: tries each iteration or stops when it hits the first error...
So that is how I made my two list (one with list in a list) a single usable list.
Next up is to try and figure out how to copy the settings folder as is and not just all of its contents. Copytree is driving me mad.
Again, any comments or advice would be welcome.
Ingwe.

Categories