I have a folder lets say "Word Assignment 1". It contains 50+ files each belonging to different student in my class. They are pdf files. They are named as xxxxxxxxxxx-name of the student-PHASE 1-MS WORD-ASSIGNMENT 1. The xxxxx represent the register number of the student and name of the file changes for each file. I have a Excel file that has register numbers and their corresponding student name. The names given by each student during submission of the pdfs are different from the required format. I want the filename as mentioned above.
I am in need of a script either in Python or Bash to rename the files by comparing the register number (which is in the first position of all files) with excel sheet and fetching the name and renaming the file according to the format.
I tried to use bash but I have no idea how to search through the excel file nor the files names.
In the following solution, I've made certain assumptions that you may not satisfy.
I've supposed the students' IDs are only numeric characters. If that is not the case, please change df["id"] == int(student_id) to df["id"] == student_id
I've assumed the column name where you store the students' IDs is id, if that is not the case, please change df["id"] to df["your_column_name"].
Similarly for the students' names column name, if it is not name, please change df.iloc[id_]["name"] to df.iloc[id_]["your_column_name"]
Here, the folder named Word Assignment 1 is located in the same folder as the script. If that is not the case, please change the path variable to the absolute (or relative) path to said folder.
Solution:
import os
import pandas as pd
from typing import List
filename: str = "your_file.xlsx"
path: str = "./Word Assignment 1"
df: pd.DataFrame = pd.DataFrame(pd.read_excel(filename, sheet_name=0))
files: List[str] = [f for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]
for file in files:
student_id: str = file.split("-")[0]
id_: int = df.index[df["id"] == int(student_id)].tolist()[0]
name: str = df.iloc[id_]["name"]
os.rename(os.path.join(path, file), os.path.join(path, f"{student_id}-{name}-PHASE 1-MS WORD-ASSIGNEMENT 1.pdf"))
Keep it simple. Save the sheet as a csv - I assume you can create a tabbed file with just ID/Name. If I call it ID.csv -
while read id name; do mv "$id"* "$id-$name-PHASE 1-MS WORD-ASSIGNMENT 1.pdf"; done < ID.csv
This assumes ID is a known and fixed length and none of the files have a following character that can misidentify and/or return multiple entries.
If the name has embedded whitespace, best to avoid including subsequent fields. If that's not an option, make sure your delimiter is distinct, such as a pipe character which doesn't usually show up in names, and set it in front of the read.
while IFS="|" read id name _ # everything after the second comma goes into _
If you have quotes in the fields it may take some more tweaking...
Just as commentary, I recommend you avoid embedding spaces and such in filenames, but only as a habit to make things like this easier.
Related
I am attempting to create an executable that renames all files in its folder.
I am lost on how to reference (and add to) the beginning of a file name.
For example:
file_text.xlsx
What I need it to look like:
10-30-2021_file_text.xlsx
I would like to append to the beginning of the file name, and add my own string(s)
NOTE: 'file_text' are randomly generated file names, and all unique. So I would need to keep the unique names and just add to the beginning of each unique file name.
Here is what I have so far to rename the other files, I figured I could reference a space but this did not work as there are no spaces.
import os
directory = os.getcwd()
txt_replaced = ' '
txt_needed = '2021-10-31'
for f in os.listdir(directory):
os.rename(os.path.join(directory, f),
os.path.join(directory, f.replace(txt_needed, txt_replaced)))
I also am curious if there is a way to reference specific positions within the file name.
For example:
text_text1.csv
If there is a way to uppercase only 'text1'?
Thank you!
replace() doesn't work because there's no space character at the beginning of the filename. So there's nothing to replace.
You also didn't add the _ character after the date.
Just use ordinary string concatenation or formatting
for f in os.listdir(directory):
os.rename(os.path.join(directory, f),
os.path.join(directory, f"{txt_needed}_{f}"))
I have data that has been collected and organized in multiple folders.
In each folder, there can be multiple similar runs -- e.g. collected data under the same conditions, at different times. These filenames contain a number in them that increments. Each folder contains similar data collected under different conditions. For example, I can have an idle folder, and in it can be files named idle_1.csv, idle_2.csv, idle_3.csv, etc. Then I can have another folder pos1 folder, and similarly, pos1_1.csv, pos1_2.csv, etc.
In order to keep track of what folder and what file the data in the arrays came from, I want to use the folder name, "idle", "pos1", etc, as the array name. Then, each file within that folder (or the data resulting from processing each file in that folder, rather) becomes another row in that array.
For example, if the name of the folder is stored in variable arrname, and the file index is stored in variable arrndx, I want to write the value into that array:
arrname[arrndx]=value
This doesn't work, giving the following error:
TypeError: 'str' object does not support item assignment
Then, I thought about using a dictionary to do this, but I think I still would run into the same issue. If I use a dictionary, I think I need each dictionary's name to be the name derived from the folder name -- creating the same issue. If I instead try to use it as a key in a dictionary, the entries get overwritten with data from every file from the same folder since the name is the same:
arrays['name']=arrname
arrays['index']=int(arrndx)
arrays['val']=value
arrays['name': arrname, 'index':arrndx, 'val':value]
I can't use 'index' either since it is not unique across each different folder.
So, I'm stumped. I guess I could predefine all the arrays, and then write to the correct one based on the variable name, but that could result in a large case statement (is there such a thing in python?) or a big if statement. Maybe there is no avoiding this in my case, but I'm thinking there has to be a more elegant way...
EDIT
I was able to work around my issue using globals():
globals()[arrname].insert(int(arrndx),value)
However, I believe this is not the "correct" solution, although I don't understand why it is frowned upon to do this.
Use a nested dictionary with the folder names at the first level and the file indices (or names) at the second.
from pathlib import Path
data = {}
base_dir = 'base'
for folder in Path(base_dir).resolve().glob('*'):
if not folder.is_dir():
continue
data[folder.name] = {}
for csv in folder.glob('*.csv'):
file_id = csv.stem.split('_')[1]
data[folder.name][file_id] = csv
The above example just saves the file name in the structure but you could alternatively load the file's data (e.g. using Pandas) and save that to the dictionary. It all depends what you want to do with it afterwards.
What about :
foldername = 'idle' # Say your folder name is idle for example
files = {}
files[filename] = [filenmae + "_" + str(i) + ".csv" for i in range(1, number_of_files_inside_folder + 2)]
does that solve your problem ?
Let's say the start.py is located in C:\.
import os
path = "C:\\Users\\Downloads\\00005.tex"
file = open(path,"a+")
file. truncate(0)
file.write("""Hello
""")
file.close()
os.startfile("C:\\Users\\Downloads\\00005.tex")
In the subdirectory could be some files. For example: 00001.tex, 00002.tex, 00003.tex, 00004.tex.
I want first to search in the subdir for the file with the highest number (00004.tex) and create a new one with the next number (00005.tex), write "Hello" and save it in the subdir 00005.tex.
Are the zeros necessary or can i also just name them 1.tex, 2.tex, 3.tex......?
Textually, "2" is greater than "100" but of course numerically, its the opposite. The reason for writing files as say, "00002.txt" and "00100.text" is that for files numbered up to 99999, the lexical sorting is the same as the numerical sorting. If you write them as "2.txt" and "100.txt" then you need to change the non-extension part of the name to an integer before sorting.
In your case, since you want the next highest number, you need to convert the filenames to integers so that you can get a maximum and add 1. Since you are converting to an integer anyway, your progam doesn't care whether you prepend zeroes or not.
So the choice is based on external reasons. Is there some reason to make it so that a textual sort works? If not, then the choice is purely random and do whatever you think looks better.
You can use glob:
import glob, os
os.chdir(r"Path")
files = glob.glob("*.tex")
entries = sorted([int(entry.split(".", 1)[0]) for entry in files])
next_entry = entries[-1]+1
next_entry can be used as a new filename. You can then create a new file with this name and write your new content to that file
I'm currently developing a loop where for each csv in the directory I am re-sampling and then saving as a new csv. However, I would like to retain only part of the original path string contained within the variable so that I can add an identifier for the new file.
For example, the file picked up through the loop may be:
'...\folder1\101_1000_RoomTemperatures.csv'
But I would like the new saved file to look like:
'...\folder2\101_1000_RoomTemperatures_Rounded.csv'
Have noticed SQL and C-related posts about this issue - however, solutions I suspect not relevant for within the python environment. Using the code below I can rename the outputs to enable differentiation, however, not ideal!
for filename in os.listdir(directory):
if filename.endswith('.csv'):
# Pull in the file
df = pd.read_csv(filename)
# actions occur here
# Export the file
df.to_csv('{}_rounded.csv'.format(str(filename)))
The output using this code is:
'...\folder1\101_1000_RoomTemperatures.csv_rounded.csv')
A simple solution would be to split by dots and omit the part after the last dot to get the base filename (without file extension):
>>> filename = r'...\folder1\101_1000_RoomTemperatures.csv'
>>> filename
'...\\folder1\\101_1000_RoomTemperatures.csv'
>>> base_filename = '.'.join(filename.split('.')[:-1])
>>> base_filename
'...\\folder1\\101_1000_RoomTemperatures'
Then use this base_filename to give the new name:
df.to_csv('{}_rounded.csv'.format(base_filename))
I am trying to use pandas.read_csv to read files that contain the date in their names. I used the below code to do the job. The problem is that the files name is not consistent as the number of date change the pattern. I was wondering if there is a way to let the code read the file with parts of the name is the date in front of the file name?
for x in range(0,10):
dat = 20170401+x
dat2 = dat+15
file_name='JS_ALL_V.'+str(dat)+'_'+str(dat2)+'.csvp.gzip'
df = pd.read_csv(file_name,compression='gzip',delimiter='|')
You can use glob library to read file names in unix style
Below is its hello world:
import glob
for name in glob.glob('dir/*'):
print name
An alternative of using glob.glob() (since it seems not working) is os.listdir() as explained in this question in order to have a list containing all the elements (or just the files) in your path.