Python: Need help comparing two lists - python

I recently discovered a huge bug in my script and I need some help figuring it out.
This script searches through multiple sub-directories in our codebase and finds all source files that fall into a certain criteria and makes a list of the names. I call this list: all_files
It then runs another program, and this program makes new source files (close to the same names) that get placed in another directory. I go into that directory and make a list of the file names in there. I call this list: covered_files
Now here's the problem, the new source file names differ in that they have leading words. For example:
--> file name in all_files: fileName.cpp
--> file name in covered_files: prefix_fileName.cpp
These two files correspond to each other, and the job of this script is to return a list of names from all_files that show up in covered_files... That is, in the above example the name "fileName.cpp" would be added to the list because it is in both.
One main issue is that there are names that would correspond to "fileName1.cpp", fileName2.cpp", and so on. And as you'll see in my code below, it only accounts for one of these files being covered when they both need to be.
What I currently have:
def find_covered_files(all_files):
covered_path = <path to the newly generated files>
# Make a list of every file in the covered folder
covered_files = [f for f in listdir(covered_path) if isfile(join(covered_path, f))]
file_match_list = []
# Seach through the covered files for matches
for cov_file in covered_files:
# Find the file matches
for files in all_files:
if files in cov_file:
file_match_list.append(files)
return file_match_list
So overall my question is: Is there a way to search through the two lists, where one of the entries is a substring of the other, that will give me each file name that is covered regardless if that substring appears more than once? Thank you in advance for any help!
Edit: Another example using some actual file names:
There are files Segment.cpp, SegmentAllocation.cpp, and SegmentImpl.cpp These would be in the all_files list with the matching prefixed names in the covered_files list.
After the code is ran, I would expect all three of them to be in the file_match_list. However, only the first file name is in there repeated 3 times.
So instead of: (desired)
['Segment.cpp', 'SegmentAllocation.cpp', 'SegmentImpl.cpp']
I get:
['Segment.cpp', 'Segment.cpp', 'Segment.cpp']

I answered the question after debugging, I needed a break statement in the for loop:
def find_covered_files(all_files):
covered_path = <path to the newly generated files>
# Make a list of every file in the covered folder
covered_files = [f for f in listdir(covered_path) if isfile(join(covered_path, f))]
file_match_list = []
# Seach through the covered files for matches
for cov_file in covered_files:
# Find the file matches
for files in all_files:
if files in cov_file:
file_match_list.append(files)
break
return file_match_list

Related

Sorting lists based on very specific criteria

I am looking to sort a list based on if there is a "." there will be 2 items in the list, I just need to make sure it is in the correct format.
My list needs to be in the format: [item1, item2.ext]
I need the item without the extension (The folder) to be first to properly use shutil.
I know I could do 2 for loops through my list but that seems wasteful, and I know I could check to see weather or not I am pointing to a file or a folder but I think it is easier to force the list to be in the correct order explicitly.
Here is my code:
# Sets up a list of files and a blank list of those to remove
files = os.listdir(currentDir)
print(files)
files_to_remove = []
print(files_to_remove)
# Loops through files looking for
for f in files: # I could break this into 2 for loops but that seems dumb
if "." not in f: #checks if is a folder
files_to_remove.append(f)
print(files_to_remove)
if f"lecture{lecturenum}.zip" in f: #Checks if is the zip file that was just unzipped
files_to_remove.append(f)
print(files_to_remove)
print(files_to_remove)
time.sleep(0.1) # This removes a synching error where windows is trying to delete it while it is still using it
shutil.rmtree(currentDir + files_to_remove[0]) # I could change this to check before deciding
os.remove(currentDir + files_to_remove[1])
Any help would be greatly appreciated

Operating within the file list from sub-directories in Python

I have been trying to operate within individual filename contained in a list of file. Each file contains a two-column data which I want to integrate and get a value. Main folder (Main directory) contains multiple subfolders (sub-directories) named after each day. All files are in '.csv' format so I don't have to worry about selecting the specific file formats. I am trying to read the contents of each file within the sub-directories with the following code.
file_path = r"C:\Users\......\Totalfiles"
read_files = glob.glob(os.path.join(file_path,"*.csv"))
np_array_values = []
for (root, dirs, files) in os.walk(file_path):
for filenames in files:
values = pd.read_csv(filenames, encoding='utf-8', header=0)
np_array_values.append(values)
print(filenames)
There are multiple errors here.
If I skip the second for loop, only a list containing 8 file gets printed as follows:
2019-10-12-18-28.csv
2019-10-12-18-28.csv
2019-10-12-18-28.csv
2019-10-12-18-28.csv
2019-10-12-18-28.csv
2019-10-12-18-28.csv
2019-10-12-18-28.csv
2019-10-12-18-28.csv
But if the for loop is used then the total list of 100+ files is generated. But the error is generated from the 6th line. when I try to assign a variable called values, then this error is displayed:
File b'2019-08-10_05-58.csv' does not exist: b'2019-08-10_05-58.csv'
This program runs totally fine for a directory if I don't use the first for loop. I couldn't find examples with this kind of problem (probably due to incorrect keywords used). I assume this would be useful for everyone working with loads of measurement files. Help on this would be much appreciated.

Find and remove duplicate files using Python

I have several folders which contain duplicate files that have slightly different names (e.g. file_abc.jpg, file_abc(1).jpg), or a suffix with "(1) on the end. I am trying to develop a relative simple method to search through a folder, identify duplicates, and then delete them. The criteria for a duplicate is "(1)" at the end of file, so long as the original also exists.
I can identify duplicate okay, however I am having trouble creating the text string in the right format to delete them. It needs to be "C:\Data\temp\file_abc(1).jpg", however using the code below I end up with r"C:\Data\temp''file_abc(1).jpg".
I have looked at answers [Finding duplicate files and removing them, however this seems to be far more sophisticated than what I need.
If there are better (+simple) ways to do this then I let me know, however I only have around 10,000 files in total in 50 odd folders, so not a great deal of data to crunch through.
My code so far is:
import os
file_path = r"C:\Data\temp"
file_list = os.listdir(file_path)
print (file_list)
for file in file_list:
if ("(1)" in file):
index_no = file_list.index(file)
print("!! Duplicate file, number in list: "+str(file_list.index(file)))
file_remove = ('r"%s' %file_path+"'\'"+file+'"')
print ("The text string is: " + file_remove)
os.remove(file_remove)
Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).
Try this:
for file_name in file_list:
if "(1)" not in file_name:
continue
original_file_name = file_name.replace('(1)', '')
if not os.path.exists(os.path.join(file_path, original_file_name):
continue # do not remove files which have no original
os.remove(os.path.join(file_path, file_name))
Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:
Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
sort all files by size, then
walk linearly through this list, identify the doubles (which are neighbours in this list) and
yield each such double-group (i. e. a small list of files (typically just two) which are identical).
Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).
By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).

Iterating through subdirectories to add unique strings to each file

My goal: To build a program that:
Opens a folder (provided by the user) from the user's computer
Iterates through that folder, opening each document in each subdirectory (named according to language codes; "AR," "EN," "ES," etc.)
Substitutes a string in for another string in each document. Crucially, the new string will change with each document (though the old string will not), according to the language code in the folder name.
My level of experience: Minimal; been learning python for a few months but this is the first program I'm building that's not paint-by-numbers. I'm building it to make a process at work faster. I'm sure I'm not building this as efficiently as possible; I've been throwing it together from my own knowledge and from reading stackexchange religiously while building it.
Research I've done on my own: I've been living in stackexchange the past few days, but I haven't found anyone doing quite what I'm doing (which was very surprising to me). I'm not sure if this is just because I lack the vocabulary to search (tried out a lot of search terms, but none of them totally match what I'm doing) or if this is just the wrong way of going about things.
The issue I'm running into:
I'm getting this error:
Traceback (most recent call last):
File "test5.py", line 52, in <module>
for f in os.listdir(src_dir):
OSError: [Errno 20] Not a directory: 'ExploringEduTubingEN(1).txt'
I'm not sure how to iterate through every file in the subdirectories and update a string within each file (not the file names) with a new and unique string. I thought I had it, but this error has totally thrown me off. Prior to this, I was getting an error for the same line that said "Not a file or directory: 'ExploringEduTubingEN(1).txt'" and it's surprising to me that the first error could request a file or a directory, and once I fixed that, it asked for just a directory; seems like it should've just asked for a directory at the beginning.
With no further ado, the code (placing at bottom because it's long to include context):
import os
ex=raw_input("Please provide an example PDF that we'll append a language code to. ")
#Asking for a PDF to which we'll iteratively append the language codes from below.
lst = ['_ar.pdf', '_cs.pdf', '_de.pdf', '_el.pdf', '_en_gb.pdf', '_es.pdf', '_es_419.pdf',
'_fr.pdf', '_id.pdf', '_it.pdf', '_ja.pdf', '_ko.pdf', '_nl.pdf', '_pl.pdf', '_pt_br.pdf', '_pt_pt.pdf', '_ro.pdf', '_ru.pdf',
'_sv.pdf', '_th.pdf', '_tr.pdf', '_vi.pdf', '_zh_tw.pdf', '_vn.pdf', '_zh_cn.pdf']
#list of language code PDF appending strings.
pdf_list=open('pdflist.txt','w+')
#creating a document to put this group of PDF filepaths in.
pdf2='pdflist.txt'
#making this an actual variable.
for word in lst:
pdf_list.write(ex + word + "\n")
#creating a version of the PDF example for every item in the language list, and then appending the language codes.
pdf_list.seek(0)
langlist=pdf_list.readlines()
#creating a list of the PDF paths so that I can use it below.
for i in langlist:
i=i.rstrip("\n")
#removing the line breaks.
pdf_list.close()
#closing the file after removing the line breaks.
file1=raw_input("Please provide the full filepath of the folder you'd like to convert. ")
#the folder provided by the user to iterate through.
folder1=os.listdir(file1)
#creating a list of the files within the folder
pdfpath1="example.pdf"
langfile="example2.pdf"
#setting variables for below
#my thought here is that i'd need to make the variable the initial folder, then make it a list, then iterate through the list.
for ogfile in folder1:
#want to iterate through all the files in the directory, including in subdirectories
src_dir=ogfile.split("/",6)
src_dir="/".join(src_dir[:6])
#goal here is to cut off the language code folder name and then join it again, w/o language code.
for f in os.listdir(src_dir):
f = os.path.join(src_dir, f)
#i admit this got a little convoluted–i'm trying to make sure the files put the right code in, I.E. that the document from the folder ending in "AR" gets the PDF that will now end in "AR"
#the perils of pulling from lots of different questions in stackexchange
with open(ogfile, 'r+') as f:
content = f.read()
f.seek(0)
f.truncate()
for langfile in langlist:
f.write(content.replace(pdfpath1, langfile))
#replacing the placeholder PDF link with the created PDF links from the beginning of the code
If you read this far, thanks. I've tried to provide as much information as possible, especially about my thought process. I'll keep trying things and reading, but I'd love to have more eyes on it.
You have to specify the full path to your directories/files. Use os.path.join to create a valid path to your file or directory (and platform-independent).
For replacing your string, simply modify your example string using the subfolder name. Assuming that ex as the format filename.pdf, you could use: newstring = ex[:-4] + '_' + str.lower(subfolder) + '.pdf'. That way, you do not have to specify the list of replacement strings nor loop through this list.
Solution
To iterate over your directory and replace the content of your files as you'd like, you can do the following:
# Get the name of the file: "example.pdf" (note the .pdf is assumed here)
ex=raw_input("Please provide an example PDF that we'll append a language code to. ")
# Get the folder to go through
folderpath=raw_input("Please provide the full filepath of the folder you'd like to convert. ")
# Get all subfolders and go through them (named: 'AR', 'DE', etc.)
subfolders=os.listdir(folderpath)
for subfolder in subfolders:
# Get the full path to the subfolder
fullsubfolder = os.path.join(folderpath,subfolder)
# If it is a directory, go through it
if os.path.isdir(fullsubfolder):
# Find all files in subdirectory and go through each of them
files = os.listdir(fullsubfolder)
for filename in files:
# Get full path to the file
fullfile = os.path.join(fullsubfolder, filename)
# If it is a file, process it (note: we do not check if it is a text file here)
if os.path.isfile(fullfile):
with open(fullfile, 'r+') as f:
content = f.read()
f.seek(0)
f.truncate()
# Create the replacing string based on the subdirectory name. Ex: 'example_ar.pdf'
newstring = ex[:-4] + '_' + str.lower(subfolder) + '.pdf'
f.write(content.replace(ex, newstring))
Note
Instead of asking the user to find write the folder, you could ask him to open the directory with a dialog box. See this question for more info: Use GUI to open directory in Python 3

Execute multiple *.dat files from subdirectories (bash, python)

I have the following:
I have directory with subdirectories which are filled with files. The structure is the following: /periodic_table/{Element}_lj_dat/lj_dat_sim.dat;
Each file consists of two rows (first one is the comment) and 12 columns of data.
What I would like to get is to go through all folders of elements (eg. Al, Cu etc.), open created file (for example named "mergedlj.dat" in periodic_table directory) and store all the data from each file in one adding Element name from parent directory as a first (or last) column of merged file.
The best way is to ignore the first row in each file and save only data from second row.
I am very unexperienced in bash/shell scripting, but I think this is the best way to go (Python is acceptable too!). Unfortunately I had only experience with files which are in the same folder as the script, so this is some new experience for me.
Here is the code just to find this files, but actually it doesn't do anything what I need:
find ../periodic_table/*_lj_dat/ -name lj_dat_sim.dat -print0 | while read -d $'\0' file; do
echo "Processing $file"
done
Any help will be highly appreciated!!
Here's a Python solution.
You can use glob() to get a list of the matching files and then iterate over them with fileinput.input(). fileinput.filename() lets you get the name of the file that is currently being processed, and this can be used to determine the current element whenever processing begins on a new file, as determined by fileinput.isfirstline().
The current element is added as the first column of the merge file. I've assumed that the field separator in the input files is a single space, but you can change that by altering ' '.join() below.
import re
import fileinput
from glob import glob
dir_prefix = '.'
glob_pattern = '{}/periodic_table/*_lj_dat/lj_dat_sim.dat'.format(dir_prefix)
element_pattern = re.compile(r'.*periodic_table/(.+)_lj_dat/lj_dat_sim.dat')
with open('mergedlj.dat', 'w') as outfile:
element = ''
for line in fileinput.input(glob(glob_pattern)):
if fileinput.isfirstline():
# extract the element name from the file name
element = element_pattern.match(fileinput.filename()).groups()[0]
else:
print(' '.join([element, line]), end='', file=outfile)
You can use os.path.join() to construct the glob and element regex patterns, but I've omitted that above to avoid cluttering up the answer.

Categories