How can I search in a subdir for a specific file? - python

Let's say the start.py is located in C:\.
import os
path = "C:\\Users\\Downloads\\00005.tex"
file = open(path,"a+")
file. truncate(0)
file.write("""Hello
""")
file.close()
os.startfile("C:\\Users\\Downloads\\00005.tex")
In the subdirectory could be some files. For example: 00001.tex, 00002.tex, 00003.tex, 00004.tex.
I want first to search in the subdir for the file with the highest number (00004.tex) and create a new one with the next number (00005.tex), write "Hello" and save it in the subdir 00005.tex.
Are the zeros necessary or can i also just name them 1.tex, 2.tex, 3.tex......?

Textually, "2" is greater than "100" but of course numerically, its the opposite. The reason for writing files as say, "00002.txt" and "00100.text" is that for files numbered up to 99999, the lexical sorting is the same as the numerical sorting. If you write them as "2.txt" and "100.txt" then you need to change the non-extension part of the name to an integer before sorting.
In your case, since you want the next highest number, you need to convert the filenames to integers so that you can get a maximum and add 1. Since you are converting to an integer anyway, your progam doesn't care whether you prepend zeroes or not.
So the choice is based on external reasons. Is there some reason to make it so that a textual sort works? If not, then the choice is purely random and do whatever you think looks better.

You can use glob:
import glob, os
os.chdir(r"Path")
files = glob.glob("*.tex")
entries = sorted([int(entry.split(".", 1)[0]) for entry in files])
next_entry = entries[-1]+1
next_entry can be used as a new filename. You can then create a new file with this name and write your new content to that file

Related

Script to change the names of the files in a folder

I have a folder lets say "Word Assignment 1". It contains 50+ files each belonging to different student in my class. They are pdf files. They are named as xxxxxxxxxxx-name of the student-PHASE 1-MS WORD-ASSIGNMENT 1. The xxxxx represent the register number of the student and name of the file changes for each file. I have a Excel file that has register numbers and their corresponding student name. The names given by each student during submission of the pdfs are different from the required format. I want the filename as mentioned above.
I am in need of a script either in Python or Bash to rename the files by comparing the register number (which is in the first position of all files) with excel sheet and fetching the name and renaming the file according to the format.
I tried to use bash but I have no idea how to search through the excel file nor the files names.
In the following solution, I've made certain assumptions that you may not satisfy.
I've supposed the students' IDs are only numeric characters. If that is not the case, please change df["id"] == int(student_id) to df["id"] == student_id
I've assumed the column name where you store the students' IDs is id, if that is not the case, please change df["id"] to df["your_column_name"].
Similarly for the students' names column name, if it is not name, please change df.iloc[id_]["name"] to df.iloc[id_]["your_column_name"]
Here, the folder named Word Assignment 1 is located in the same folder as the script. If that is not the case, please change the path variable to the absolute (or relative) path to said folder.
Solution:
import os
import pandas as pd
from typing import List
filename: str = "your_file.xlsx"
path: str = "./Word Assignment 1"
df: pd.DataFrame = pd.DataFrame(pd.read_excel(filename, sheet_name=0))
files: List[str] = [f for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]
for file in files:
student_id: str = file.split("-")[0]
id_: int = df.index[df["id"] == int(student_id)].tolist()[0]
name: str = df.iloc[id_]["name"]
os.rename(os.path.join(path, file), os.path.join(path, f"{student_id}-{name}-PHASE 1-MS WORD-ASSIGNEMENT 1.pdf"))
Keep it simple. Save the sheet as a csv - I assume you can create a tabbed file with just ID/Name. If I call it ID.csv -
while read id name; do mv "$id"* "$id-$name-PHASE 1-MS WORD-ASSIGNMENT 1.pdf"; done < ID.csv
This assumes ID is a known and fixed length and none of the files have a following character that can misidentify and/or return multiple entries.
If the name has embedded whitespace, best to avoid including subsequent fields. If that's not an option, make sure your delimiter is distinct, such as a pipe character which doesn't usually show up in names, and set it in front of the read.
while IFS="|" read id name _ # everything after the second comma goes into _
If you have quotes in the fields it may take some more tweaking...
Just as commentary, I recommend you avoid embedding spaces and such in filenames, but only as a habit to make things like this easier.

Most efficient way to check if a string contains any file format?

I have a .txt with hundreds of thousands of paths and I simply have to check if each line is a folder or a file. The hard drive is not with me so I can't use the module os with the os.path.isdir() function. I've tried the code below but it is just not perfect since some folders contains . at the end.
for row in files:
if (row[-6:].find(".") < 0):
folders_count += 1
It is just not worth testing if the ending of the string contains any known file format (.zip, .pdf, .doc ...) since there are dozens of different files format inside this HD. When my code reads the .txt, it stores each line as a string inside an array, so my code should work with the string format.
An example of a folder path:
'path1/path2/truckMV.34'
An example of a file path:
'path1/path2/certificates.pdf'
It's impossible for us to judge if it's a file or path just by the string since an extension is just an arbitrary agreeable string that programs choose to decode in a certain way.
Having said that, if I had the same problem I would do my best to estimate with the following pseudo code:
Create a hash map (or a dictionary as you are in Python)
For every line of the file, read the last bit and see if there's a "." in the last path
Create a key for it on the hash map with a counter of how many times you have encountered the "possible extensions".
After you go through all of the list you will have a collection of possible extensions and how many you have encountered them. Assume the ones with only 1 occurrence (or any other low arbitrary number) to be a path and not an extension.
The basis of this heuristic is that it's unlikely for a person to have a lot of unique extensions on their desktop - but that's just an assumption I came up with.

Find and remove duplicate files using Python

I have several folders which contain duplicate files that have slightly different names (e.g. file_abc.jpg, file_abc(1).jpg), or a suffix with "(1) on the end. I am trying to develop a relative simple method to search through a folder, identify duplicates, and then delete them. The criteria for a duplicate is "(1)" at the end of file, so long as the original also exists.
I can identify duplicate okay, however I am having trouble creating the text string in the right format to delete them. It needs to be "C:\Data\temp\file_abc(1).jpg", however using the code below I end up with r"C:\Data\temp''file_abc(1).jpg".
I have looked at answers [Finding duplicate files and removing them, however this seems to be far more sophisticated than what I need.
If there are better (+simple) ways to do this then I let me know, however I only have around 10,000 files in total in 50 odd folders, so not a great deal of data to crunch through.
My code so far is:
import os
file_path = r"C:\Data\temp"
file_list = os.listdir(file_path)
print (file_list)
for file in file_list:
if ("(1)" in file):
index_no = file_list.index(file)
print("!! Duplicate file, number in list: "+str(file_list.index(file)))
file_remove = ('r"%s' %file_path+"'\'"+file+'"')
print ("The text string is: " + file_remove)
os.remove(file_remove)
Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).
Try this:
for file_name in file_list:
if "(1)" not in file_name:
continue
original_file_name = file_name.replace('(1)', '')
if not os.path.exists(os.path.join(file_path, original_file_name):
continue # do not remove files which have no original
os.remove(os.path.join(file_path, file_name))
Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:
Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
sort all files by size, then
walk linearly through this list, identify the doubles (which are neighbours in this list) and
yield each such double-group (i. e. a small list of files (typically just two) which are identical).
Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).
By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).

How do find the average of the numbers in multiple text files?

I have multiple (around 50) text files in a folder and I wish to find the mean average of all these files. Is there a way for python to add up all the numbers in each of these files automatically and find the average for them?
I assume you don't want to put the name of all the files manually, so the first step is to get the name of the files in python so that you can use them in the next step.
import os
import numpy as np
Initial_directory = "<the full address to the 50 files you have ending with />"
Files = []
for file in os.listdir(Initial_directory):
Path.append( begin + file0 )
Now the list called "Files" has all the 50 files. Let's make another list to save the average of each file.
reading the data from each file depends on how the data is stored but I assume that in each line there is a single value.
Averages = []
for i in range(len(Files)):
Data = np.loadtxt(Files[i])
Averages.append (np.average(Data))
looping over all the files, Data stores the values in each file and then their average is added to the list Averages.
This can be done if we can unpack the steps needed to accomplish.
steps:
Python has a module called os that allows you to interact with the file system. You'll need this for accessing the files and reading from them.
declare a few variables as counters to be used for the duration of your script, including the directory name where the files reside.
loop over files in the directory, increment the total file_count variable by 1 (to get the total number of files, used for averaging at the end of the script).
join the file's specific name with the directory to create a path for the open function to find the accurate file.
read each file and add each line (assuming it's a number) within the file to the total number count (used for averaging at the end of the script), removing the newline character.
finally, print the average or continue using it in the script for whatever you need.
You could try something like the following:
#!/usr/bin/env python
import os
file_count=0
total=0
dir_name='your_directory_path_here'
for files in os.listdir(dir_name):
file_count+=1
for file_name in files:
file_path=os.path.join(dir_name,file_name)
file=open(file_path, 'r')
for line in file.readlines():
total+=int(line.strip('\n'))
avg=(total/file_count)
print(avg)

Python automated file names

I want to automate the file name used when saving a spreadsheet using xlwt. Say there is a sub directory named Data in the folder the python program is running. I want the program to count the number of files in that folder (# = n). Then the filename must end in (n+1). If there are 0 files in the folder, the filename must be Trial_1.xls. This file must be saved in that sub directory.
I know the following:
import xlwt, os, os.path
n = len([name for name in os.listdir('.') if os.path.isfile(name)])
counts the number of files in the same folder.
a = n + 1
filename = "Trial_" + "a" + ".xls"
book.save(filename)
this will save the file properly named in to the same folder.
My question is how do I extend this in to a sub directory? Thanks.
os.listdir('.') the . in this points to the directory from where the file is executed. Change the . to point to the subdirectory you are interested in.
You should give it the full path name from the root of your file system; otherwise it will be relative to the directory from where the script is executed. This might not be what you want; especially if you need to refer to the sub directory from another program.
You also need to provide the full path to the filename variable; which would include the sub directory.
To make life easier, just set the full path to a variable and refer to it when needed.
TARGET_DIR = '/home/me/projects/data/'
n = sum(1 for f in os.listdir(TARGET_DIR) if os.path.isfile(os.path.join(TARGET_DIR, f)))
new_name = "{}Trial_{}.xls".format(TARGET_DIR,n+1)
You actually want glob:
from glob import glob
DIR = 'some/where/'
existing_files = glob(DIR + '*.xls')
filename = DIR + 'stuff--%d--stuff.xls' % (len(existing_files) + 1)
Since you said Burhan Khalid's answer "Works perfectly!" you should accept it.
I just wanted to point out a different way to compute the number. The way you are doing it works, but if we imagine you were counting grains of sand or something would use way too much memory. Here is a more direct way to get the count:
n = sum(1 for name in os.listdir('.') if os.path.isfile(name))
For every qualifying name, we get a 1, and all these 1's get fed into sum() and you get your count.
Note that this code uses a "generator expression" instead of a list comprehension. Instead of building a list, taking its length, and then discarding the list, the above code just makes an iterator that sum() iterates to compute the count.
It's a bit sleazy, but there is a shortcut we can use: sum() will accept boolean values, and will treat True as a 1, and False as a 0. We can sum these.
# sum will treat Boolean True as a 1, False as a 0
n = sum(os.path.isfile(name) for name in os.listdir('.'))
This is sufficiently tricky that I probably would not use this without putting a comment. But I believe this is the fastest, most efficient way to count things in Python.

Categories