How to combine excel files with different folders - python

I am new to python, have got a task of combining the excel files. I have 100 folders, each folder have 2 sub folders and in each sub folder have 24 excel files. now I have to find maximum value and minimum values of each and every 24 files and that value I have to concatenate with parent excel file (this one have to do for all 24 files). then i have to concatenate all 24 files have to write on first column of excel file. and this should be repeat for all the 100 folder, so finally i have to get single excel file with 100 column.
presently I am using manual method for every file and it is over writing is become complicated and time consuming please someone help me to get-out of that method
data12 = pd.read_excel (r'C:\Users\Videos\1.xlsx')
A= max(data12)
C= min(data12)
frame_data= [data12, A, C]
result = pd.concat(frame_data)
result.to_excel("output1.xlsx", sheet_name='modify_data', index=False)

You can use python glob library to browse through all the files in all the folders. You just need to pass the name of master folder. and then use one loop to read all the files one by one.
Link for reference: Python glob multiple filetypes

Related

Extracting multiple Excel files from different paths in python at once while they are zipped in various places

I have 30 folders (from 2021-06-01 to 2021-06-30) and in each of them I have 15 excel files. Currently I am using this code separately 30 times
file1= glob.glob('C:/Users/Dell/Downloads/2021-06-01/*')
to make a file file1 that run data processing operation for each single folder( going into 15 excel files).So that I can have file 1 to file 30 and then I concat them to make one single file.
Is there any way to automate this process as I dont want to run this operation 30 times separately? I am not figuring out how to make a loop for this purpose of extracting file from different paths.
Also I have the data but they are zipped inside the folders (from 2021-06-01 to 2021-06-30). So, it is tedious to go there one by one and unzip them and then run the operation one by one.
How could I achieve both purpose in a easier way? I saw unzipping operation solutions by search by I cant do them while I also have to obtain the another purpose I mentioned about ( going through different folders and extract them one by one at once, making file 1 to file 30 at once)
my directories look like:
- download
-month
-2021-01-01
-AA
-file.zip
-a list of .xlsx file
-BB
-CC
-2021-01-02
-AA
-file.zip
-a list of .xlsx file
-BB
-CC
...........................................................................................................................................................................
-2021-01-30
Now I dont want to concat these xlsx file. I want to run a certain operation on these excel file one by one and then concat them .However not being able to do so.
Here's a script in Python that should work for you:
import os
import shutil
import time
import pandas as pd
def read_csv_or_excel(f):
if f.endswith(".csv"):
df = pd.read_csv(f"{root}/{f}", sep="\t")
if f.endswith(".xlsx"):
df = pd.read_excel(f"{root}/{f}")
return df
for root, dirs, files in os.walk("./questions/69878352/"):
# print(root, dirs, files)
if root.split("/")[-1].startswith("20"):
print(root)
appended = []
dfs = []
for f in files:
if f not in appended:
print(f)
if f.endswith(".csv") or f.endswith(".xlsx"):
dfs.append(read_csv_or_excel(f))
elif f.endswith(".zip"):
shutil.unpack_archive(f"{root}/{f}", f"{root}/")
time.sleep(0.5)
f = f"{f[:-4]}.xlsx" # ← this assumes any zipped files will be Excel files...
dfs.append(read_csv_or_excel(f))
else:
continue
appended.append(f[:-4])
pd.concat(dfs).to_excel(f"{root}.xlsx"
Lmk if it doesn't work! My test data isn't the best I'd have to spend some more time making better test data to be 100% it would work, so if you have any issues, it's probably just a little tweak that's necessary to amend it 🙂
You could also try just using bash in your terminal:
$ find . -maxdepth 5 -name *.zip | parallel unzip # this will unzip everything in one command
$ find . -maxdepth 5 -name *.xlsx | parallel # perform whatever operation you want on all the excel files

Is there way faster way to edit xlsm file other than openpyxl?

file='excel.xlsm'
wb=openpyxl.load_workbook(filename=file, read_only=False, keep_vba=True)
sheet=wb['Template']
rowx=['x','y','z']
rows=sheet.max_row
sheet.cell(row=rows+1, column=j+1).value=row[j]
wb.save(file)
I have an xlsm file and I have tried edit it using openpyxl. But there is a problem. When I try to edit a xlsm which has 4 or 5 templates, the size of the file is 4-5 mb which takes a lot of time to load when using openpyxl. Is there any way that I can modify my current file faster without having to create a new one?
An .xlsx or .xlsm file is basically a bunch of XML files wrapped up in a zip file.
In my scripts repo I have an example (unlock-excel.py) where I use the zipfile module to open and modify an excel file. (In this case removing the <sheetProtect> and <workbookProtect> elements from the pages)
What I learned about the excel file format for creating this program is documented in this article on my website. The highlights:
An xlsx (or xlsm) file is basically a zip-file with a standard directory structure and a gaggle of XML files. When unpacking these files, I generally found the following directories.
> mkdir foo
> cd foo
> unzip ../foo.xlsx
> find . -type d|sort
.
./_rels
./docProps
./xl
./xl/_rels
./xl/printerSettings
./xl/theme
./xl/worksheets
./xl/worksheets/_rels
The _rels directories don't matter for this purpose.
The docProps directory contains two files; app.xml and core.xml. The app.xml file basically contained a list of the titles as seen on the tabs on the bottom of the worksheets. The titles are listed in this file in the sequence they appear in the xlsx file from left to right. They are bracketed between <vt:lpstr> and </vt:lpstr> tags.
The workbook.xml file in the xl directory contains a number of sheet definitions. These link the name of the worksheet to several numbers. Each sheet has a single tag with attributes, like this.
<sheet name="template" sheetId="4" r:id="rId1"/>
In the subdirectory xl/worksheets there is a number of XML files named sheetN.xml, where N is a number. These are the actual worksheets. One might expect that N corresponds with the sheetId. But that turns out not to be the case. The sheet number N actually is the number in the r:id attribute after the rId text. So in the example above, the worksheet named template is xl/worksheets/sheet1.xml.

How can i read all files from a directory and do operations parallelly?

Suppose I have some files in directory and i want to read each file and extract the file name and first row from the file (i.e header) for some validation. How can we do this in spark (using python).
input_file = sc.textFile(sourceFileDir)
By sc.textFile() we can read all files parallelly but using map we can apply any rules or function to each element in the rdd. I am not understanding how can i fetch only file name and one row of all files using sc.textFile()
Currently, I am doing these requirement (mentioned above) using a for loop.
files = os.listdir(sourceFileDir)
for x in files:
**operations**
How can i do the same in parallel manner to all files that will save some times as there are lots of files in the directory.
Thanks in advance ..
textFile is not what you are looking for. You should use wholeTextFile. It creates a rdd with key as FileName and value as content. Then you apply a map to get only first line:
sc.wholeTextFiles(sourceFileDir).map(lambda x : (x[0], x[1].split('\n')[0]))
By doing that, the output of your map is the fileName and the 1st line.

Python Iterating program

I am a beginner programmer who is working on my first project. I am trying to create a script that unzips two files and extracts a folder that contains .csv files to a temp directory. I am hoping to import and format those .csv with Xlsx lib. My code is able to do the first part, it unzips all the .csv's perfectly.
I need some pointers on how to iterate over all the .csv's on the temp folder and copy the data of each .csv to an excel spreadsheet. I must note that the .csv files have only one row with 5 columns of data. Here is what I have:
for zfiles in glob.glob("*.zip"):
with zipfile.ZipFile(zfiles, 'r') as myS:
myS.extractall(tempDir)
os.chdir(tempDir)
for z in glob.glob("*.zip"):
with zipfile.ZipFile(z, 'r') as mySecondS:
mySecondS.extractall()
Thank you!

Renaming and Saving Excel Files With Python

I've got a pretty simple task but I haven't done too many functions with excel within python and I'm not sure how to go about doing this.
What I need to do:
Look at many excel files within subfolders, rename them according to information within the file and store them in all in one folder somewhere else.
The data is structured like this:
Main Folder
Subfolder1
File1
File2
File3
...
For about a hundred subfolders and several files within each subfolder.
From here, I want to pull the company name, part number, and date from within the file and use those to rename the excel file. Not sure how to rename the file.
Then save it somewhere else. I'm having trouble finding all these functions, any advice?
Check the os and os.path module for listing folder contents (walk, listdir) and working with path names (abspath, basename etc.)
Also, shutil has some interesting functions for copying stuff. Check out copyfile and specify the dst parameter based on the data you read from the excel file.
This page can help you getting at the Excel data: http://www.python-excel.org/
You probably want to have some highlevel code like this:
for subfolder_name in os.listdir(MAIN_FOLDER):
# exercise left to reader: filter out non-folders
subfolder_path = os.path.join(MAIN_FOLDER, subfolder_name)
for excel_file_name in os.listdir(os.path.join(MAIN_FOLDER, subfolder_name)):
# exercise left to reader: filter out non-excel-files
excel_file_path = os.path.join(subfolder_path, excel_file_name)
new_excel_file_name = extract_filename_from_excel_file(excel_file_path)
new_excel_file_path = os.path.join(NEW_MAIN_FOLDER, subfolder_name,
new_excel_file_name)
shutil.copyfile(excel_file_path, new_excel_file_path)
You'll have to provide extract_filename_from_excel_file yourself using the xlrd module from the site I mentioned.

Categories