Renaming and Saving Excel Files With Python - python

I've got a pretty simple task but I haven't done too many functions with excel within python and I'm not sure how to go about doing this.
What I need to do:
Look at many excel files within subfolders, rename them according to information within the file and store them in all in one folder somewhere else.
The data is structured like this:
Main Folder
Subfolder1
File1
File2
File3
...
For about a hundred subfolders and several files within each subfolder.
From here, I want to pull the company name, part number, and date from within the file and use those to rename the excel file. Not sure how to rename the file.
Then save it somewhere else. I'm having trouble finding all these functions, any advice?

Check the os and os.path module for listing folder contents (walk, listdir) and working with path names (abspath, basename etc.)
Also, shutil has some interesting functions for copying stuff. Check out copyfile and specify the dst parameter based on the data you read from the excel file.
This page can help you getting at the Excel data: http://www.python-excel.org/
You probably want to have some highlevel code like this:
for subfolder_name in os.listdir(MAIN_FOLDER):
# exercise left to reader: filter out non-folders
subfolder_path = os.path.join(MAIN_FOLDER, subfolder_name)
for excel_file_name in os.listdir(os.path.join(MAIN_FOLDER, subfolder_name)):
# exercise left to reader: filter out non-excel-files
excel_file_path = os.path.join(subfolder_path, excel_file_name)
new_excel_file_name = extract_filename_from_excel_file(excel_file_path)
new_excel_file_path = os.path.join(NEW_MAIN_FOLDER, subfolder_name,
new_excel_file_name)
shutil.copyfile(excel_file_path, new_excel_file_path)
You'll have to provide extract_filename_from_excel_file yourself using the xlrd module from the site I mentioned.

Related

Changing filenames in folders

I have a folder that contains a lot of files that has a lot of copies which make them unreadable.
Example:
cow.txt
cow.txt(1)
cow.txt(2)
cow.txt(3)
dog.txt
dog.txt(1)
I would like to to have all the files structured in away that makes them able to be opened. Example
cow.txt
cow(1).txt
cow(2).txt
cow(3).txt
dog.txt
dog(1).txt
Any help you can provided would be greatly appreciated. I am just looking to make sure there name is changed, and am not looking to read each individual file. In addition if possible I would like to break up the files into 20k blocks. Thank you in advance.
I have tried using os.rename to simply rename the file but I am confused on how to do the efficiently as the numbers come after the .txt I then decided to read all the files and convert them to a pandas data frame and fix it that way. However I am confused on how to pull the files and make them with that name.
list_of_files = os.listdir()
df = pd.DataFrame(list_of_files, columns = ['File_Name'])
df['.txt_removed'] = df.replace(to_replace = '.txt', value = '', regex = True)
df['txt_add'] = df['.txt_removed'] + '.txt'
To pull the files I would do something like this
for filewant_in df['txt_add']:
if filewant in os.listdir():
sutil.copy(os.path.join(filewant), 'new location')
I do not think this option will work even though it gives me my intended result. As I would like to change the overall file names.
You can use python's standard library, the os module has the os.rename function.
Like this:
It works like this:
os.rename('cow.txt(1)', 'cow(1).txt')
Create a .py file and paste the code below then run it. Change /mydir path with the path to the directory having the files. The code will loop through the directory finding all the containing have .txt as part of the file extension and renaming them to a .txt file. I hope it works.
import glob, os
os.chdir("/mydir")
for file in glob.glob("*.txt*"):
file_name = os.path.basename(file)
part_name = file_name.split(".", 1)
new_name = part_name[0]+'.txt'
os.rename(file,new_name)

How to extract files from across multiple folders in Python

Essentially... I have 27 folders on a local drive, each comprising tens of thousands of .jpgs. I have generated a random list of 5,000 of these images, distributed accross the 27 folders, that I wish to transfer into one folder.
I have a .csv list of all the 5,000 filenames I need, and was wondering what the easiest way to get these all into one folder would be?
I see plenty of onine resources explaing how to achieve this using 'glob.glob' and 'os.walk' methods to extract all files with a specific file extension (e.g. .txt), however I need to extract them based on the specific filenames that I have in a list.
I generally work in Python, but if there's another obvious non-coding way of doing this without it that I'm missing, please do suggest that too.
Thanks, R
This is a relatively basic Python solution, but maybe you can tweak it to something more useful.
import csv
import os
import shutil
# Generator for all files in root_dir
def list_files(my_root_dir):
for path, _, files in os.walk(my_root_dir):
yield path, files
def copy_files(my_csv_file, my_root_dir, my_target_dir):
with open(my_csv_file, "r") as csvfile:
csv_reader = csv.reader(csvfile, delimiter=",")
for line in csv_reader:
for image in line:
for path, files in list_files(my_root_dir):
# Careful not to copy files already copied to target_dir
if image in files and path != my_target_dir:
shutil.copy(os.path.join(path, image), my_target_dir)
break
If you want to stick to the shell, there are some ideas here: https://unix.stackexchange.com/questions/402728/how-to-move-files-specified-in-a-text-file-to-another-directory-on-bash.

AttributeError: 'DataFrame' object has no attribute 'path'

I'm trying incrementally to build a financial statement database. The first steps center around collecting 10-Ks from the SEC's EDGAR database. I have code for pulling the relevant 8-Ks, 10-Ks, and 10-Qs by CIK number and accession number, and retrieving the relevant excel spreadsheet. The code below is now centering on trying to create a folder within a directory, then name the folder with the CIK code, then pull the spreadsheet from the EDGAR database, and save the spreadsheet to the folder with the CIK code. My example is a csv file I'm calling "accessionnumtest.csv", which has headings:
company_name,report_type,cik,date,cik_accession
and data:
4Less Group, Inc.,10K/A,1438901,11/27/2019,edgar/data/1438901/000121390019024801.txt
AB INTERNATIONAL GROUP CORP.,10K,1605331,10/22/2019,edgar/data/1605331/000166357719000384.txt
ABM INDUSTRIES INC /DE/,10K,771497,12/20/2019,edgar/data/771497/000162828019015259.txt
ACTUANT CORP,10K,6955,10/29/2019,edgar/data/6955/000000695519000033.txt
my code is below
import os
import pandas as pd
path = os.getcwd()
folder_path = "C:/metricdatadb/"
df = pd.read_csv("accessionnumtest.csv")
folder_name = df['cik']
print(folder_name)
for row in df.iterrows():
dir = df.path.join(folder_path, folder_name)
os.makedirs(dir)
This code is giving me, AttributeError: 'DataFrame' object has no attribute 'path' error. I have renamed the path, checked for whitespace in the headers. Any suggestions are appreciated.
Regarding the error: os.path.join. Not pd.path.join. You are calling the wrong module.
That being said, your code is not doing what you are trying to do regardless of the error. folder_name will not update for each row. You could do row.cik to get the value for each iterrows()
dir = os.path.join(folder_path, row.cik)
It is relatively unclear what you're working towards accomplishing, particularly with .csv files and Pandas. The code you have seems to have a lot of curious errors in it, which I think might be ameliorated by going back to learn some of the more simple Python concepts before trying something as difficult as web-scraping. Note I don't mean to give up, rather than building up the fundamentals is a necessary step in this type of project.
That said, if I'm understanding your intent correctly, you want to create a file hierarchy for 10-K, 10-Q, etc. filings for several CIKs.
There shouldn't be any need to use .csv files, or pandas for this.
Probably the simplest way to do this would be to do it in the same step you download them.
Pseudocode for this would be as follows:
for cik in list_of_ciks:
first_file = find_first_file_online();
if first_file is 10-K:
save_to_10-K folder for CIK
if first_file is 10-Q:
save_to_10-Q folder for CIK
As I said above, you can skip the .csv file (Also, note that CSV stands for "comma-separated-value." Some of the entries in your data contain commas, e.g. "4Less Group, Inc." This is incompatible with a CSV file, as it will split the single entry into two columns on the comma, shifting all of your data one column).
When you process the data, you'll want to build the folders as you go.
When you iterate through a new CIK, create the master folder for that CIK. When you encounter a 10-K, create a folder for 10-K's and save it with a unique name. Since you need to use the accession numbers to get the excel sheets, that wouldn't be a bad naming convention to follow.
It would be something like this:
import requests
import pathlib
cik_list = [cik_1, cik_2... cik_n]
for cik in cik_list:
file = requests.get("cik/accession/Report.xlsx").data
with open(pathlib.Path(cik, report_type, accession_number + ".xlsx", "wb")) as excel_file:
excel_file.write(file.data)
The above code will definitely not run, and does not include everything you would need to make it work, since that information has been written by you. Integrating the above concepts into your code is up to you.
To reiterate, you have the CIK, the accession number, and the report type. To save the files in folders, you need only create the folders as you go, with the form "CIK/report_type/accession.xlsx"

How to work with CSV files inside a zipped folder?

I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!
You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.

Python - Navigating through Subdirectories that Meet Naming Criteria

I am using Python 3.5 to analyze data contained in csv files. These files are contained in a "figs" directory, which is contained in a case directory, which is contained in an overall data directory, e.g.:
/strm1/serino/DATA/06052009/figs
Or more generally:
/strm1/serino/DATA/case_date_in_MMDDYYYY/figs
The directory I am starting in is '/strm1/serino/DATA/,' and each subdirectory is the month, day, and year of a case I am working with. Each subdirectory contains another subdirectory named 'figs,' and that is the location of each case's csv file. To be exact:
/strm1/serino/DATA/case_date_in_MMDDYYYY/figs/case_date_in_MMDDYYYY.csv
So, I would like to start in my DATA directory and go through its subdirectories to find those that have the MMDDYYYY naming. However, some of the case directories may be named with a state abbreviation at the end, like: '06052009_TX.' Therefore, instead of matching the MMDDYYYY naming exactly, it could be something as simple as verifying that the directory name contains any number 1 through 9.
Once I am in the first subdirectory (the case directory) I would like to move into the 'figs' subdirectory. Once there, I want to access the csv file with the same naming convention as the first subdirectory (the case directory). I will fill existing arrays with the data contained in each csv file.
Basically, my question concerns navigating through multiple subdirectories that match a certain naming convention and ultimately accessing the data file at the "end." I was naively playing around with glob, fnmatch, os.listdir, and os.walk, but I could not get anything close enough to working that I feel would be helpful to include. I am not very familiar with those modules. What I can include is what I am going for:
for dirs in data_dir that contain a number:
go into this directory
go into 'figs' directory
read data from the csv file whose name matches its case directory name (or whose name format matches the case directory name format)
I have come across related questions, but I have not been able to apply their answers in the way that I would like, especially with nested directories. I really appreciate the help, and let me know if I need to clarify anything.
The following should get you going. It uses the datetime.strptime() function to attempt to convert each folder name into a valid datetime object. If the conversion fails, then you know that the folder name is not in the correct format and can be skipped. It then attempts to parse any CSV file found in the corresponding fig folder:
from datetime import datetime
import glob
import csv
import os
dirpath, dirnames, filenames = next(os.walk('/strm1/serino/DATA'))
for dirname in dirnames:
if len(dirname) >= 8:
try:
dt = datetime.strptime(dirname[:8], '%m%d%Y')
print(dt, dirname)
csv_folder = os.path.join(dirpath, dirname)
for csv_file in glob.glob(os.path.join(csv_folder, 'figs', '*.csv')):
with open(csv_file, newline='') as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
print(row)
except ValueError as e:
pass
You listed several problems above. Which one are you stuck on? It seems like you already know how to navigate the file storage system using os.path. You may not know of the function os.path.join() which allows you to manually specify a file path relative to a file as such:
os.path.abspath(os.path.join(os.path.dirname(__file__), '../..', 'Data/TrailShelters/'))
To break down the above:
os.path.dirname(__file__) returns the path of the current file. '../..' means: go up two levels in the folder hierarchy. And Data/TrailShelters/ is the directory I wish to navigate to.
How does this apply to your particular case? Well, you will need to make some adaptations but you can store the os.path of the parent directory in a variable. Then you can essentially use a while sub_dir is not null loop to iterate through subdirectories. For every subdirectory you will want to examine its os.path and extract the particular part of the path you are interested in. Then you can simply use something like: if 'TN' in subdirectory_name to determine if it is a subdirectory you are interested in. If so; then update the saved os.path of the parent directory by appending the path to the subdirectory. Does that make any sense?

Categories