Copying data from multiple excel files to specified columns using python - python

I'm trying to create a script which would run through excel files in a folder and copy the contents to a workbook. The aim is to copy the contents of each file onto different columns where the spacing between the columns is a set difference, ie. columns: A, D(A+3) & G(D+3). For my example I am running my code with 3 base datasets.
When I run the code, the final dataset ends up copying across the final excel document 3 times across the specified columns, instead of copying the 3 unique documents to the specified columns.
What I want: A B C
What I get: C C C
Code:
import os
import openpyxl
from openpyxl import Workbook, load_workbook
import string
for file in os.listdir(file_path):
if file.endswith('.xlsx'):
print(f'Loading file {file}...')
wb = load_workbook(file_path+file)
ws = wb.worksheets[0]
wb1 = load_workbook(new_path+'data.xlsx')
ws1 = wb1.active
#calculate max rows and columns in source dataset
mr = ws.max_row
mc = ws.max_column
m = [0,3,6]
#copying data to new sheet
for i in range(1,mr+1):
for j in range(1,mc+1):
for y in range(0,3):
#reading cell value from source
c = ws.cell(row = i, column = j)
#writing read value to destination
ws1.cell(row = i, column = j+int(m[y])).value = c.value
wb1.save(new_path+'data.xlsx')
Thank you for your help.
Edit:
The data is all in the same format and looks like:https://ibb.co/TMStH9j Current output: https://ibb.co/dmcbSJ1 Desired output: https://ibb.co/C1nqKJv

You need to move the creation and saving of the new workbook out of the for loop so that it is not overwritten each time a new file is looped over.
Also you need a way to count how many files you have looped over, so that you can increment the columns where the new data is copied to in the new workbook. Please see below:
Edit:
To get your expected output, I also removed the inner-most for loop and m list to rather use a single variable to space the columns of each new excel data apart.
import os
import openpyxl
from openpyxl import Workbook, load_workbook
import string
# Create new workbook outside of for loop so that it is not overwritten each loop
wb1 = Workbook()
ws1 = wb1.active
# count variable so each loop increments the column where the data is posted
count = 0
# how many columns to space data apart
col_spacing = 2
for file in os.listdir(file_path):
if file.endswith(".xlsx"):
print(f"Loading file {file}...")
wb = load_workbook(file_path + file)
ws = wb.worksheets[0]
# calculate max rows and columns in source dataset
mr = ws.max_row
mc = ws.max_column
# copying data to new sheet
for i in range(1, mr + 1):
for j in range(1, mc + 1):
# reading cell value from source
c = ws.cell(row=i, column=j)
# writing read value to destination
ws1.cell(row=i, column=count + j + (count * col_spacing)).value = c.value
# increment column count
count += 1
# save new workbook after all files have been looped through
wb1.save(new_path + "data.xlsx")

Related

Read a worksheet and write the content into a new one - Openpyxl

dear community
I was struggling with a piece of code in Python that could get data from a Excel worksheet by reading and after create a new sheet with that data. `
It's not just a copy of the file, because it allows to make something with data on the way before saving it in a new file.
I was reading a file, saving in a intermediary list and after trying to save in the new xls file.
It didn't work because of data type weren't talking with each other. And I got stuck.
I saw this code below from Python Engineering by Michael Zippo, that helped me.
# importing openpyxl module
import openpyxl as xl;
# opening the source excel file
filename ="C:\\Users\\Admin\\Desktop\\trading.xlsx"
wb1 = xl.load_workbook(filename)
ws1 = wb1.worksheets[0]
# opening the destination excel file
filename1 ="C:\\Users\\Admin\\Desktop\\test.xlsx"
wb2 = xl.load_workbook(filename1)
ws2 = wb2.active
# calculate total number of rows and
# columns in source excel file
mr = ws1.max_row
mc = ws1.max_column
# copying the cell values from source
# excel file to destination excel file
for i in range (1, mr + 1):
for j in range (1, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row = i, column = j)
# writing the read value to destination excel file
ws2.cell(row = i, column = j).value = c.value
# saving the destination excel file
wb2.save(str(filename1))
After looking up to new thing about Michael Zippo, (https://python.engineering/python-how-to-copy-data-from-one-excel-sheet-to-another/).
I found a way to improve the read-write FOR loop above:
from openpyxl import Workbook, load_workbook
wb1 = load_workbook('bank_statement.xlsx')
wb2 = Workbook()
sh1 = wb1.active
sh2 = wb2.active
for r in sh1.iter_rows():
for c in r:
sh2[c.coordinate]= c.value
wb2.save('bank_stat_improved.xlsx')
In the middle of the loop, you can do something with data and it will be a very useful code.

add specific rows and column into another excel file using openpyxl in python

I'm trying to merge multiple file into one excel file using openpyxl in python
I know there is a way using panda, but my files have a problem there have been always 2 empty rows in the beginning of the excel file
So to avoid that I'm using openpyxl with the old way
Just open all files and copy the specific rows and columns to a new one
The first step I find out how to do it by just copy the specific row's and column of the new xlsx file
but I didn't find a way to add the next file (only the value not the header) under the first one
this my code
So far it just copy the first file (the header and the value)
But I didn't find out how to add the next file (only the value) under the first one
import openpyxl as xl
from openpyxl import Workbook
import os
def find_xlsx_files():
# the current path
dir_path = os.path.dirname(os.path.abspath(__file__))
# list to store files
res = []
# Iterate directory
for file in os.listdir(dir_path):
# check only xlsx files
if file.endswith('.xlsx'):
res.append(file)
return (res)
wb1 = xl.load_workbook (find_xlsx_files()[0])
ws1 = wb1.worksheets [0]
# open target Excel file
wb2 = Workbook()
ws = wb2.active
ws.title = "Changed Sheet"
wb2.save(filename = 'sample_book.xlsx')
ws2 = wb2.active
# calculate the total rows and
# columns in the Excel source file
mr = ws1.max_row
mc = ws1.max_column
# copy cell values ​​from source
# Excel file to target Excel file
for i in range ( 3 , mr + 1 ):
for j in range ( 2 , mc + 1 ):
# read cell value from Excel source file
c = ws1.cell (row = i, column = j)
# writing the read value to the target Excel file
ws2.cell (row = i, column = j) .value = c.value
# save target Excel file
wb2.save ( str ('sample_book.xlsx'))
What you are doing is creating a list of the excel files in the default directory then just opening the first file '[0]' in the list with the line;
wb1 = xl.load_workbook (find_xlsx_files()[0])
This will never attempt to access any other excel file in the list. Having the list generation in the load book command isn't good, you don't want to be generating the list of available excel files each time you process a file. Calling of the function find_xlsx_files() should be done once.
The easiest fix to your code is to get your list of excel files and then iterate that list for processing.
excel_files = find_xlsx_files()
for xl_file in excel_files:
wb1 = xl.load_workbook(xl_file)
...
Also it should not be necessary to save the book until you have finished writing all data.
The function can be simplified using glob instead if you prefer.
import glob
import os
from openpyxl import Workbook, load_workbook
dir_path = os.path.dirname(os.path.abspath(__file__))
excel_files = glob.glob(dir_path + "/[!~]*.xlsx")
for xl_file in excel_files:
wb1 = load_workbook(xl_file)
ws1 = wb1.worksheets[0]
# open target Excel file
wb2 = Workbook()
ws = wb2.active
ws.title = "Changed Sheet"
# wb2.save(filename='sample_book.xlsx')
ws2 = wb2.active
# calculate the total rows and
# columns in the Excel source file
mr = ws1.max_row
mc = ws1.max_column
# copy cell values ​​from source
# Excel file to target Excel file
for i in range(3, mr+1):
for j in range(2, mc+1):
# read cell value from Excel source file
c = ws1.cell(row=i, column=j)
# writing the read value to the target Excel file
ws2.cell(row=i, column=j).value = c.value
# save target Excel file
wb2.save(str('sample_book.xlsx'))
This is also assuming there is only one sheet in each excel file you want to process since you're only opening the first sheet.
ws1 = wb1.worksheets[0]

Merging two separate tabs of excel sheets

I have two separate excel sheets (xlsx format),
Excel 1 - Has 2 separate tabs.
Tab 1 has summary information linked to Tab 2 and
Tab 2 is the data to be taken from Excel 2
Excel 2 - Has relevant info (which needs to be copied to tab 2 of excel 1)
Sample of 2 files are shared in the below link
https://drive.google.com/drive/folders/1inrofeT6v9P0ISEcmbswvpxMMCq5TaV0?usp=sharing
Name references of both the files are the same. Basically, I want to copy the information from Excel 2 and paste it to Excel 1 (Which has a summary sheet to provide summary information)
I tried the below code
# importing openpyxl module
import openpyxl as xl
# opening the source excel file
filename ="D:\\1. Python Extracts\\KA-AVRB-Feb22-4.xlsx"
wb1 = xl.load_workbook(filename)
ws1 = wb1.worksheets[0]
# opening the destination excel file
filename1 ="D:\\2. Summary shees\\KA-AVRB-Feb22-4.xlsx"
wb2 = xl.load_workbook(filename1)
ws2 = wb2.worksheets[1]
# calculate total number of rows and columns in source excel file
mr = ws1.max_row
mc = ws1.max_column
# copying the cell values from source excel file to destination excel file
for i in range (1, mr + 1):
for j in range (1, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row = i, column = j)
# writing the read value to destination excel file
ws2.cell(row = i+1, column = j).value = c.value
# saving the destination excel file
wb2.save(str(filename1))
The above code works with individual files. However, I have 2 sets of 140 excel files (i.e 140 excel summary sheets and 140 excel sheets having data), where I need to copy data from one file and paste it to another as explained above.
I understand I can try to place a for loop for the same, but after much trial, I'm unable to achieve the same.
Help would be highly appreciated!
Keeping source files in a subfolder named sourceFiles, and summaries in a subfolder named summary, we can iterate over all source files and run your function over them to make the summaries.
import os
# importing openpyxl module
import openpyxl as xl
for _, _, file in os.walk("/sourceFiles", topdown=False):
makeSummary(file)
def makeSummary(filename):
# opening the source excel file
#filename ="D:\\1. Python Extracts\\KA-AVRB-Feb22-4.xlsx"
wb1 = xl.load_workbook(filename)
ws1 = wb1.worksheets[0]
# opening the destination excel file
filename1 =".\summary\"+filename
wb2 = xl.load_workbook(filename1)
ws2 = wb2.worksheets[1]
# calculate total number of rows and columns in source excel file
mr = ws1.max_row
mc = ws1.max_column
# copying the cell values from source excel file to destination excel file
for i in range (1, mr + 1):
for j in range (1, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row = i, column = j)
# writing the read value to destination excel file
ws2.cell(row = i+1, column = j).value = c.value
# saving the destination excel file
wb2.save(str(filename1))
PS: I haven't run this code on your files and I am uncertain about slashes now since I used this ages ago. Hence, please try to debug the path if this iteration doesn't work. To see how python's walk() works, refer this.

How to write the data that has been read into a certain column using openpyxl

I cannot figure out how to write the data that has been read below to a certain column, such as column F. Column B has been read and I want to paste it in the same workbook into column F. Eventually I would create a function because I would be doing the reading and writing columns multiple times.
import openpyxl
import os
# Finds current directory
current_path = os.getcwd()
print(current_path)
# Changes directory
os.chdir('C:\\Users\\Shane\\Documents\\Exel Example')
# prints new current directory
new_path = os.getcwd()
print(new_path)
# load workbook
wb = openpyxl.load_workbook('example.xlsx')
type(wb)
# load worksheet
ws1 = wb.active
# read sheet names
sht_names = wb.sheetnames
print(sht_names)
# ***reads and prints column B***
col_b = list(ws1.columns)[1]
for cellObj in col_b:
print(cellObj.value)
# write column b's contents into column F
from openpyxl import load_workbook
t = load_workbook("test.xlsx")
s = t.active # get the active worksheet
lst = list(s.columns)
lst_row = list(s.rows)
cellA1 = s[1][0]
rowA = s[1]
g = (x.value for x in lst[1]) # column b?
for item in g:
print (item)

How to filter column data using openpyxl

I am trying to apply a filter to an existing Excel file, and export it to another Excel file. I would like to extract rows that only contain the value 16, then export the table to another excel file (as shown in the picture below).
I have tried reading the openpyxl documentation multiple times and googling for solutions but I still can't make my code work. I have also attached the code and files below
import openpyxl
# Is use to create a reference of the Excel to wb
wb1 = openpyxl.load_workbook('test_data.xlsx')
wb2 = openpyxl.load_workbook('test_data_2.xlsx')
# Refrence the workbook to the worksheets
sh1 = wb1["data_set_1"]
sh2 = wb2["Sheet1"]
sh1.auto_filter.ref = "A:A"
sh1.auto_filter.add_filter_column(0, ["16"])
sh1.auto_filter.add_sort_condition("B2:D6")
sh1_row_number = sh1.max_row
sh1_col_number = sh1.max_column
rangeSelected = []
for i in range(1, sh1_row_number+1, 1):
rowSelected = []
for j in range(1, sh1_col_number+1, 1):
rowSelected.append(sh1.cell(row = i, column = j))
rangeSelected.append(rowSelected)
del rowSelected
for i in range(1, sh1_row_number+1, 1):
for j in range(1, sh1_col_number+1, 1):
sh2.cell(row = i, column = j).value = rangeSelected[i-1][j-1].value
wb1.save("test_data.xlsx")
wb2.save("test_data_2.xlsx")
The pictures shows what should be the desire result
The auto filter doesn't actually filter the data, it is just for visualization.
You probably want to filter while looping through the workbook. Please note with this code I assume you have the table headers already in the second workbook. It does not overwrite the data, it appends to the table.
import openpyxl
# Is use to create a reference of the Excel to wb
wb1 = openpyxl.load_workbook('test_data.xlsx')
wb2 = openpyxl.load_workbook('test_data_2.xlsx')
# Refrence the workbook to the worksheets
sh1 = wb1["data_set_1"]
sh2 = wb2["data_set_1"] # use same sheet name, different workbook
for row in sh1.iter_rows():
if row[0].value == 16: # filter on first column with value 16
sh2.append((cell.value for cell in row))
wb1.save("test_data.xlsx")
wb2.save("test_data_2.xlsx")

Categories