Having trouble getting function to execute after panda preprocessing - python

I am doing some work with pandas that requires some preprocessing so that I can graph as intended. Right now I am looping through column names and deleting the ones I do not need. After I have done that I do a merge with another panda df so that I can execute the next function call. The code looks something like:
def makePlotFile(df, asg, dueDate, path, outputFile, gradesPath=None):
print(gradesPath)
vizData(df, asg, dueDate, path)
vizAttempts(df, asg, dueDate, path)
vizFirstAttempt(df, asg, dueDate, path)
graph1 = path + "/output1.pdf"
graph2 = path + "/output2.pdf"
graph3 = path + "/output3.pdf"
ready = False
if gradesPath != None:
print("Will include grade information")
grades = pd.read_csv(gradesPath, error_bad_lines=False)
for column_name, _ in grades.iteritems():
if asg not in column_name:
if column_name != 'Email':
del grades[column_name]
if asg in column_name:
grade = column_name
ready = True
print('GOT COLUMN NAME')
#await asyncio.wait(grade)
if ready:
print('HIT IT')
pd.merge(df, grades, on='Email')
vizGradesFirstAttempt(df, asg, dueDate, path,grade)
graph4 = path + "/output4.pdf"
pdfs = [graph1, graph2, graph3, graph4]
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write(outputFile)
merger.close()
if gradesPath == None:
print("No grade information given")
pdfs = [graph1, graph2, graph3]
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write(outputFile)
merger.close()
I have added in some print statements to help me debug. The statement "Will include grade information" prints but does not make it to the other print statements. I am not sure if it has to do with a synchronous issue or what. I would appreciate some guidance. I am assuming I am missing something small, but not sure what.
The function calls to:
vizAttempts(df, asg, dueDate, path)
vizFirstAttempt(df, asg, dueDate, path)```
All work as expected. They are functions that create graphs using my dataframe. I thne merge them into a single PDF. this is what the graph1, graph2, and graph3 help me with.

Related

How to insert a data frame as an object attribute

This is most likely a pretty basic question, but I am still learning about classes/objects/constructors/etc. and I am trying to apply some of these concepts to my current workflow.
I am trying to create a class that automatically saves my data frame as a CSV or xlsx file, depending on what I specify, to a given folder. However, I don't believe that I am correctly passing my data frame as an object attribute. This is my code as it stands:
award_date_change = merged_df.loc[merged_df['award_date_change'] == 'yes'] #this is my data frame
class uploading_to_GC:
def __init__(self, file_name, file_type, already_exists): #constructor where I want to pass my data frame, file type to be saved to, and specifying if the file already exists in my folder
self.file_name = file_name
self.file_type = file_type
self.already_exists = already_exists
def print_file_name(self):
self.file_name.head(5)
def private_workspace(self):
commonPath = os.path.expanduser(r"~\path")
GCdocs = commonPath + '384593683' + '\\'
path = GCdocs + "" + file_name
if len(self.file_name) != 0 and self.already_exists == True: #if a file already exists in Gfolder
if self.file_type == "csv": #for csv files
GC_old = pd.read_csv(path)
GC_new = GC_old.append(self.file_name, ignore_index=True)
GC_new.to_csv(path, index = False)
print("csv file is updated to private workspace in GCdocs")
elif self.file_type == "xlsx": #for xlsx files
GC_old = pd.read_csv(path)
GC_new = GC_old.append(self.file_name, ignore_index=True)
GC_new.to_excel(path, index = False)
print("excel file is updated to private workspace in GCdocs")
else:
print("unrecognized file type")
elif len(self.file_name) != 0 and self.already_exists == False: #if a file does FOLDER already exist in folder
if self.file_type == "csv":
self.file_name.to_csv(path,index=False)
if self.file_type == "xlsx":
self.file_name.to_excel(path,index=False)
else:
print("unrecognized file type")
else:
print("there is no data to upload")
award_date_change = uploading_to_GC(award_date_change,"csv", False)
award_date_change.private_workspace
I am aware that I don't need to use a class to do this, but I wanted to challenge myself to start using classes more often. Any help would be appreciated
You can pass and store a df in a Class as a data member very simply:
class Foo:
def __init__(df: pd.DataFrame):
self.df = df
# or, if you want to be sure you don't modify the original df
self.df = df.copy()
df = pd.DataFrame()
foo_obj = Foo(df)
Edit: the : pd.DataFrame is for type-hinting. This does not affect the actual code, but is merely useful to the reader that we are expecting a pd.DataFrame as input. Good IDEs will also give you an error if you don't pass a DataFrame.

How can I loop a pathway through a function that is only taking raw strings?

I am currently writing a script that generates a report (output is .csv) on directory contents. Each report is unique in that it saves with unique date/timestamp, so the report doesn't save over itself or append to the same file each time.
The column headers in the report are as follows;
header = ['File_Pathway', 'Subdir', 'File_Name', 'Extension', 'Size_(in_bytes)', 'File_Created', 'Last_File_Save_Date', 'File_Author', 'Last_Saved_By_User_X']
I am struggling to get the File_Author and Last_Saved_By_User_X, but found a script here that collects this information using file metadata:
import win32com.client
sh=win32com.client.gencache.EnsureDispatch('Shell.Application',0)
ns = sh.NameSpace(r'm:\music\Aerosmith\Classics Live!')
colnum = 0
columns = []
while True:
colname=ns.GetDetailsOf(None, colnum)
if not colname:
break
columns.append(colname)
colnum += 1
for item in ns.Items():
print (item.Path)
for colnum in range(len(columns)):
colval=ns.GetDetailsOf(item, colnum)
if colval:
print('\t', columns[colnum], colval)
The issue I run into is with ns = sh.NameSpace(r'm:\music\Aerosmith\Classics Live!') as it only takes raw strings. The pathway that I want to pass to sh.NameSpace is a variable that loops through the directory, it's the current_filepath as the script is looping through the directory of files.
I have tried every method from this article to convert the string variable into a raw string to pass through this function but nothing is working. Can anyone help shed some light on this for me?
For more context, here is some more sample code from the script I am writing to show you what the current_filepath variable is:
rootdir = input('Enter directory pathway: ')
count = 0
datetime_for_filename = datetime.now()
datetime_for_filename_format = str(datetime.strftime(datetime_for_filename, '%Y-%m-%d--%H-%M-%S'))
filename_with_datetimestamp = 'filename_printout' + '-' + datetime_for_filename_format + '.csv'
header = ['File_Pathway', 'Subdir', 'File_Name', 'Extension', 'Size_(in_bytes)', 'File_Created', 'Last_File_Save_Date', 'File_Author', 'Last_Saved_By_User_X']
for subdir, dirs, files in os.walk(rootdir):
with open(filename_with_datetimestamp, 'a', newline='') as f:
writer = csv.writer(f)
current_subdir = subdir
try:
for filenames in files:
data_list = []
current_filepath = subdir + '\\''' + filenames
raw_current_filepath = fr"{current_filepath}"

Editing several excel files before after iterating through path folder - python

I'm working on editing several Excel files at once and when it comes time to iterate through all of my folders, it is only capable of doing so for the first .xlsx file.
def sumOfCosts():
path=os.chdir(r'C:\Users\salvarai\AppData\Roaming\Python\Python310\site-packages\COSTBOMS')
for file in os.listdir(path):
if file.endswith(".xlsx"):
wb=load_workbook(filename=file)
sheet=wb.active()
sheet['08'].value="Total Cost="
char=get_column_letter(8)
sumchar=get_column_letter(16)
sheet[sumchar+"8"]=F"=SUM({char+'2'}:{char +'1000'})"
wb.save(file)
wb.close()
return
You have your return statement on the same indent level as the 'if file.endswith(".xlsx"):' so regardless of the statements in the if condition being executed the 'return' statement will be on the first 'file'.
To keep iterating thru the files move the 'return' to be level with the line 'for file in os.listdir(path)' so the function does not return until all the 'file's are processed.
def sumOfCosts():
path = os.chdir(r'C:\Users\salvarai\AppData\Roaming\Python\Python310\site-packages\COSTBOMS')
for file in os.listdir(path):
if file.endswith(".xlsx"):
wb = load_workbook(filename=file)
sheet = wb.active()
sheet['08'].value = "Total Cost ="
char = get_column_letter(8)
sumchar = get_column_letter(16)
sheet[sumchar + "8"] = F"=SUM({char+'2'}:{char +'1000'})"
wb.save(file)
wb.close()
return # <--- The return should be at this indent level

Create folder after every execution with different names

I'm trying to make a program which creates a new folder with different names inside a folder for every execution being made. I've pasted the code below I'm using:
import os
current_directory = os.getcwd()
name = "Day 1"
def folder_create(path, folder_name):
folder_names = [folder for folder in os.listdir(
path) if os.path.isdir(folder)]
if folder_name not in folder_names:
os.makedirs(folder_name)
else:
folder_num = folder_name.split(' ')[1]
new_folder_name = f'Day {int(folder_num) + 1}'
os.makedirs(new_folder_name, exist_ok=False)
folder_name = new_folder_name
return folder_name
if __name__ == '__main__':
name = folder_create(current_directory, name)
print(name)
This code only works twice meaning that it's only able to create two folders i.e, (Day 1 and Day 2) after executed two times but after then it gives FileExistError. Please help me find a way around as I just want it to create a new folder with every next day name i,e(Day 1, Day 2, Day 3) for each time executed.
This will work nicely. The real issue with your code was you kept supplying the same "name" argument to the folder_create() function.
All i've done is remove the need for supplying the name argument.
What it now does instead is to get the list of directories, sort the list, then get the last one using negative indexing. It then uses the last one create the new folder name. have fun
import os
current_directory = os.getcwd()
name = "Day 1"
def folder_create(path):
folder_names = [folder for folder in os.listdir(
path) if os.path.isdir(folder)]
folder_names.sort()
if "Day 1" not in folder_names:
os.makedirs('Day 1')
else:
folder_num = folder_names[-1].split(' ')[1]
new_folder_name = f'Day {int(folder_num) + 1}'
os.makedirs(new_folder_name, exist_ok=False)
folder_name = new_folder_name
return folder_name
if __name__ == '__main__':
name = folder_create(current_directory)
print(name)
This Solution Might Work For You-
import os
import random
current_directory = os.getcwd()
day_names = []
for i in range(0, 10+1): # 10 or Or Other Number+1 To Get Day Names That Number Of Times
day_names.append('Day '+str(i))
def folder_create(path, folder_name):
folder_names = [folder for folder in os.listdir(
path) if os.path.isdir(folder)]
if folder_name not in folder_names:
os.makedirs(folder_name)
else:
folder_num = folder_name.split(' ')[1]
new_folder_name = f'Day {int(folder_num) + random.randint(0, 10000)}' # To Get A Random Day Name If It Already Exixts
os.makedirs(new_folder_name, exist_ok=False)
folder_name = new_folder_name
return folder_name
if __name__ == '__main__':
for name in day_names: # Looping To Get Each Day Name
name = folder_create(current_directory, name)
print(name)

Is it possible to create a python script that looks for files in a directory on a given time daily?

So basically, I'm creating a directory that allows users to put csv files in there. But I want to create python script that would look in that folder everyday at a given time (lets say noon) and pick up the latest file that was placed in there if it's not over a day old. But I'm not sure if that's possible.
Its this chunk of code that I would like to run if it the app finds a new file in the desired directory:
def better_Match(results, best_percent = "Will need to get the match %"):
result = {}
result_list = [{item.name:item.text for item in result} for result in results]
if result_list:
score_list = [float(item['score']) for item in result_list]
match_index = max(enumerate(score_list),key=lambda x: x[1])[0]
logger.debug('MRCs:{}, Chosen MRC:{}'.format(score_list,score_list[match_index]))
logger.debug(result_list[match_index])
above_threshold = float(result_list[match_index]['score']) >= float(best_percent)
if above_threshold:
result = result_list[match_index]
return result
def clean_plate_code(platecode):
return str(platecode).lstrip('0').zfill(5)[:5]
def re_ch(file_path, orig_data, return_columns = ['ex_opbin']):
list_of_chunk_files = list(file_path.glob('*.csv'))
cb_ch = [pd.read_csv(f, sep=None, dtype=object, engine='python') for f in tqdm(list_of_chunk_files, desc='Combining ch', unit='chunk')]
cb_ch = pd.concat(cb_ch)
shared_columns = [column_name.replace('req_','') for column_name in cb_ch.columns if column_name.startswith('req_')]
cb_ch.columns = cb_ch.columns.str.replace("req_", "")
return_columns = return_columns + shared_columns
cb_ch = cb_ch[return_columns]
for column in shared_columns:
cb_ch[column] = cb_ch[column].astype(str)
orig_data[column] = orig_data[column].astype(str)
final= orig_data.merge(cb_ch, how='left', on=shared_columns)
return final
For running script at certain time:
You can use cron for linux.
In windows you can use windows scheduler
Here is an example for getting latest file in directory
files = os.listdir(output_folder)
files = [os.path.join(output_folder, file) for file in files]
files = [file for file in files if os.path.isfile(file)]
latest_file = max(files, key=os.path.getctime)
This will do the job!
import os
import time
import threading
import pandas as pd
DIR_PATH = 'DIR_PATH_HERE'
def create_csv_file():
# create files.csv file that will contains all the current files
# This will run for one time only
if not os.path.exists('files.csv'):
list_of_files = os.listdir(DIR_PATH )
list_of_files.append('files.csv')
pd.DataFrame({'files':list_of_files}).to_csv('files.csv')
else:
None
def check_for_new_files():
create_csv_file()
files = pd.read_csv('files.csv')
list_of_files = os.listdir(DIR_PATH )
if len(files.files) != len(list_of_files):
print('New file added')
#do what you want
#save your excel with the name sample.xslx
#append your excel into list of files and get the set so you will not have the sample.xlsx twice if run again
list_of_files.append('sample.xslx')
list_of_files=list(set(list_of_files))
#save again the curent list of files
pd.DataFrame({'files':list_of_files}).to_csv('files.csv')
print('Finished for the day!')
ticker = threading.Event()
# Run the program every 86400 seconds = 24h
while not ticker.wait(86400):
check_for_new_files()
It basically uses threading to check for new files every 86400s which is 24h, and saves all the current files in a directory where the py file is in and checks for new files that does not exist in the csv file and append them to the files.csv file every day.

Categories