file='excel.xlsm'
wb=openpyxl.load_workbook(filename=file, read_only=False, keep_vba=True)
sheet=wb['Template']
rowx=['x','y','z']
rows=sheet.max_row
sheet.cell(row=rows+1, column=j+1).value=row[j]
wb.save(file)
I have an xlsm file and I have tried edit it using openpyxl. But there is a problem. When I try to edit a xlsm which has 4 or 5 templates, the size of the file is 4-5 mb which takes a lot of time to load when using openpyxl. Is there any way that I can modify my current file faster without having to create a new one?
An .xlsx or .xlsm file is basically a bunch of XML files wrapped up in a zip file.
In my scripts repo I have an example (unlock-excel.py) where I use the zipfile module to open and modify an excel file. (In this case removing the <sheetProtect> and <workbookProtect> elements from the pages)
What I learned about the excel file format for creating this program is documented in this article on my website. The highlights:
An xlsx (or xlsm) file is basically a zip-file with a standard directory structure and a gaggle of XML files. When unpacking these files, I generally found the following directories.
> mkdir foo
> cd foo
> unzip ../foo.xlsx
> find . -type d|sort
.
./_rels
./docProps
./xl
./xl/_rels
./xl/printerSettings
./xl/theme
./xl/worksheets
./xl/worksheets/_rels
The _rels directories don't matter for this purpose.
The docProps directory contains two files; app.xml and core.xml. The app.xml file basically contained a list of the titles as seen on the tabs on the bottom of the worksheets. The titles are listed in this file in the sequence they appear in the xlsx file from left to right. They are bracketed between <vt:lpstr> and </vt:lpstr> tags.
The workbook.xml file in the xl directory contains a number of sheet definitions. These link the name of the worksheet to several numbers. Each sheet has a single tag with attributes, like this.
<sheet name="template" sheetId="4" r:id="rId1"/>
In the subdirectory xl/worksheets there is a number of XML files named sheetN.xml, where N is a number. These are the actual worksheets. One might expect that N corresponds with the sheetId. But that turns out not to be the case. The sheet number N actually is the number in the r:id attribute after the rId text. So in the example above, the worksheet named template is xl/worksheets/sheet1.xml.
Related
I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.
I have many directories that to find copy files from a list that i provide in a csv format. The csv file would be something like
File_A.log,RED
File_B.log,BlUE
File_C.log,RED
File_D.log,RED
File_E.log,BLUE
File_A.log,File_B.log,File_C.log and File_D.log are the file names that I will be searching for in the many folders directories that has a standard structure of "/XXX/YYY/ZZZ/"
RED and BLUE is the folder where I want to copy the found file i.e /BLUE and /RED
How can I achieve this in the most efficient way using either powershell or python
I am new to python, have got a task of combining the excel files. I have 100 folders, each folder have 2 sub folders and in each sub folder have 24 excel files. now I have to find maximum value and minimum values of each and every 24 files and that value I have to concatenate with parent excel file (this one have to do for all 24 files). then i have to concatenate all 24 files have to write on first column of excel file. and this should be repeat for all the 100 folder, so finally i have to get single excel file with 100 column.
presently I am using manual method for every file and it is over writing is become complicated and time consuming please someone help me to get-out of that method
data12 = pd.read_excel (r'C:\Users\Videos\1.xlsx')
A= max(data12)
C= min(data12)
frame_data= [data12, A, C]
result = pd.concat(frame_data)
result.to_excel("output1.xlsx", sheet_name='modify_data', index=False)
You can use python glob library to browse through all the files in all the folders. You just need to pass the name of master folder. and then use one loop to read all the files one by one.
Link for reference: Python glob multiple filetypes
I am working on a webscraper tool which downloads excel files from a website. Of course, those .xls files are actually just renamed .csv files, which prevents me from just combining the .xls files together. Instead, I need to convert them all to .csv, them use pyexcel's pyexcel.merge_csv_to_a_book(filelist, outfilename='merged.xls') function to create a excel book from these .csv files.
Here is what I tried:
def concatenate_excel_files():
indexer = 0
excel_file_list = []
for file in glob.glob(os.getcwd()+'\Reports\*.'):
pyexcel.save_as(file_name=file, dest_file_name=str(indexer)+'.csv')
excel_file_list[indexer] = file
indexer += 1
pyexcel.merge_csv_to_a_book(excel_file_list, outfilename='merged.xls')
This fails to even convert the files to .csv (IndexError: list index out of range error.)
Any help rewriting this would be appreciated.
Answer by chfw:
for pyexcel to work properly, it needs to know file extension but in your case, the file extension is missing. And it will more helpful if the full stack trace is shown.
I've got a pretty simple task but I haven't done too many functions with excel within python and I'm not sure how to go about doing this.
What I need to do:
Look at many excel files within subfolders, rename them according to information within the file and store them in all in one folder somewhere else.
The data is structured like this:
Main Folder
Subfolder1
File1
File2
File3
...
For about a hundred subfolders and several files within each subfolder.
From here, I want to pull the company name, part number, and date from within the file and use those to rename the excel file. Not sure how to rename the file.
Then save it somewhere else. I'm having trouble finding all these functions, any advice?
Check the os and os.path module for listing folder contents (walk, listdir) and working with path names (abspath, basename etc.)
Also, shutil has some interesting functions for copying stuff. Check out copyfile and specify the dst parameter based on the data you read from the excel file.
This page can help you getting at the Excel data: http://www.python-excel.org/
You probably want to have some highlevel code like this:
for subfolder_name in os.listdir(MAIN_FOLDER):
# exercise left to reader: filter out non-folders
subfolder_path = os.path.join(MAIN_FOLDER, subfolder_name)
for excel_file_name in os.listdir(os.path.join(MAIN_FOLDER, subfolder_name)):
# exercise left to reader: filter out non-excel-files
excel_file_path = os.path.join(subfolder_path, excel_file_name)
new_excel_file_name = extract_filename_from_excel_file(excel_file_path)
new_excel_file_path = os.path.join(NEW_MAIN_FOLDER, subfolder_name,
new_excel_file_name)
shutil.copyfile(excel_file_path, new_excel_file_path)
You'll have to provide extract_filename_from_excel_file yourself using the xlrd module from the site I mentioned.