Compare two excel files in python - python

import xlrd
wb_1 = xlrd.open_workbook('Book1.xls', on_demand=True)
ws_1 = wb_1.sheet_by_name('Sheet3')
wb_2 = xlrd.open_workbook('Book2.xls', on_demand=True)
ws_2 = wb_2.sheet_by_name('Sheet3')
for i in range(ws_1.ncols):
col_value1 = ws_1.cell_value(0, i)
for cell in range(ws_1.nrows):
cell_value1 = ws_1.cell(cell, i)
for j in range(ws_2.ncols):
col_value2 = ws_2.cell_value(0, i)
for cell in range(ws_2.nrows):
cell_value2 = ws_2.cell(cell, i)
if cell_value2 == cell_value1:
print('same')
Im trying to compare two excel worksheets, Im not sure whether im going in a right way.How to find the changed values

Try the below code for extracting row and columns differences.
import xlrd
wb_1 = xlrd.open_workbook('Book1.xlsx', on_demand=True)
ws_1 = wb_1.sheet_by_name('Sheet3')
rw,cl,rw2,cl2=[[] for i in range(4)]
for i in range(0,ws_1.ncols):
col_value1 = ws_1.cell(0, i).value
cl.append(col_value1)
for cell in range(0,ws_1.nrows):
row_value1 = ws_1.cell(cell, i).value
rw.append(row_value1)
wb_2 = xlrd.open_workbook('Book2.xlsx', on_demand=True)
ws_2 = wb_2.sheet_by_name('Sheet3')
for i in range(0,ws_2.ncols):
col_value2 = ws_2.cell(0, i).value
cl2.append(col_value2)
for cell in range(0,ws_2.nrows):
row_value2 = ws_2.cell(cell, i).value
rw2.append(row_value2)
for i in range(len(cl)):
for j in range(len(cl2)):
if cl[i]!=cl2[j]:
print("column difference",i,j)
for i in range(len(rw)):
for j in range(len(rw2)):
if rw[i]!=rw2[j]:
print("row difference",i,j)

Try to convert Excel into CSV file, it will seperate your values with commas. Library is called CSV
Just "import csv"
Then open file using "with", get rows with column names and you will get list or dictionary ( depends on approach )
You'll have to just compare list indexes and that's the easiest way.
Read article:
https://realpython.com/python-csv/

Related

Search specific text(text pattern) in excel and copy all resulting rows to another sheet in same workbook using openpyxl

I have an excel file with multiple sheets, 3rd column(contains around 500 rows) of sheet3 contains various names. I want to search column 3 for specific text and if it matches then copy the whole row along with the header row to new sheet within same excel.
Issue with "name column" is that most of the text refer to same item but naming convention is different, so:
अपर तहसीलदार,
नायाब तहसीलदार,
नायब तहसीलदार,
अतिरिक्त तहसीलदार
refers to same item but written differently, so for that I have to search for all variants.
I have no prior Python or openpyxl background so what I've got so far is:
import openpyxl
wb = openpyxl.load_workbook(r'C:/Users/Anas/Downloads/rcmspy.xlsx')
#active worksheet data
ws = wb.active
def wordfinder(searchString):
for i in range(1, ws.max_row + 1):
for j in range(1, ws.max_column + 1):
if searchString == ws.cell(i,j).value:
print("found")
print(ws.cell(i,j))
wordfinder("अपर तहसीलदार")
It is not showing any error but don't print anything either.
The excel sheet looks something like this:
I'm not certain, but I would suggest something along the lines of:
variants = {'alpha','alfa','elfa'}
data = []
rowCount = 0
for row in ws.values:
//each row is an array of cells
if rowCount == 0:
//header row
data.append(row)
elif row[2] in variants:
data.append(row)
rowCount += 1
wsNew = wb.create_sheet('Variations')
for line in data:
wsNew.append(line)
wb.save('newWorkbook.xlsx')

Openpyxl: We found a problem with some content

I am getting the error message 'We found a problem with some content' opening a file I generated with openpyxl. The file is being generated by concatenating different xlsx files and adding additional formulas in further cells.
The problem is caused by a Formula with an if-condition I am writing into a cell (the second for loop is causing the excel error message).
That's the code:
import openpyxl as op
import glob
# Search for all xlsx files in directory and assign them to variable allfiles
allfiles = glob.glob('*.xlsx')
print('Following files are going to be included into the inventory: ' + str(allfiles))
# Create a workbook with a sheet called 'Input'
risk_inventory = op.load_workbook('./Report/Risikoinventar.xlsx', data_only = False)
input_sheet = risk_inventory['Input']
risk_inventory.remove(input_sheet)
input_sheet = risk_inventory.create_sheet()
input_sheet.title = 'Input'
r_maxrow = input_sheet.max_row + 1
# There is more code here which is not related to the problem
for i in range (2,r_maxrow):
if input_sheet.cell(row = i, column = 2).value == 'Top-Down':
input_sheet.cell(row = i, column = 20).value = '=IF(ISTEXT(H{}),0,IF(H{}<=1000000,1,IF(H{}<=2000000,2,IF(H{}<=4000000,3,IF(H{}<=8000000,4,IF(H{}>8000000,5,0))))))'.format(i,i,i,i,i,i)
elif input_sheet.cell(row = i, column = 2).value == 'Bottom-Up':
input_sheet.cell(row = i, column = 20).value = '=IF(ISTEXT(H{}),0,IF(H{}<=1000000,1,IF(H{}<=2000000,2,IF(H{}<=4000000,3,IF(H{}<=8000000,4,IF(H{}>8000000,5,0))))))'.format(i,i,i,i,i,i)
for i in range (2,r_maxrow):
if input_sheet.cell(row = i, column = 2).value == 'Top-Down':
input_sheet.cell(row = i, column = 21).value = '=IF(K{}="Sehr gering",1,IF(K{}="Gering",2,IF(K{}="Mittel",3,IF(K{}="Hoc",3,IF(K{}="Sehr hoch",3,0))))))'.format(i,i,i,i,i,i)
elif input_sheet.cell(row = i, column = 2).value == 'Bottom-Up':
input_sheet.cell(row = i, column = 21).value = '=IF(K{}="Sehr gering",1,IF(K{}="Gering",2,IF(K{}="Mittel",3,IF(K{}="Hoc",3,IF(K{}="Sehr hoch",3,0))))))'.format(i,i,i,i,i,i)
So depending on what information is in cell(row = i, column = 2) I want a specific formula in cell(row = i, column = 21). The first for loop works perfectly, second for loop causes the error message in excel and the formulas are not being pasted in)
As you probably already see I am trying to code with Python for a week an have never ever tried coding before…
Many thanks in advance!
I've been having the same issue, and it was due to an incorrectly written formula. I found what was wrong by clicking "View" instead of "Delete" when opening the file.

How can I concatenate multiple rows of excel data into one?

I'm currently facing an issue where I need to bring all of the data shown in the images below into one line only.
So using Python and Openpyxl, I tried to write a parsing script that reads the line and only copies when values are non-null or non-identical, into a new workbook.
I get out of range errors, and the code does not keep just the data I want. I've spent multiple hours on it, so I thought I would ask here to see if I can get unstuck.
I've read some documentation on Openpyxl and about making lists in python, tried a couple of videos on youtube, but none of them did exactly what I was trying to achieve.
import openpyxl
from openpyxl import Workbook
path = "sample.xlsx"
wb = openpyxl.load_workbook(path)
ws = wb.active
path2 = "output.xlsx"
wb2 = Workbook()
ws2 = wb2.active
listab = []
rows = ws.max_row
columns = ws.max_column
for i in range (1, rows+1):
listab.append([])
cellValue = " "
prevCell = " "
for c in range (1, rows+1):
for r in range(1, columns+1):
cellValue = ws.cell(row=r, column=c).value
if cellValue == prevCell:
listab[r-1].append(prevCell)
elif cellValue == "NULL":
listab[r-1].append(prevCell)
elif cellValue != prevCell:
listab[r-1].append(cellValue)
prevCell = cellValue
for r in range(1, rows+1):
for c in range (1, columns+1):
j = ws2.cell(row = r, column=c)
j.value = listab[r-1][c-1]
print(listab)
wb2.save("output.xlsx")
There should be one line with the below information:
ods_service_id | service_name| service_plan_name| CPU | RAM | NIC | DRIVE |
Personally I would go with pandas.
import pandas as pd
#Loading into pandas
df_data = pd.read_excel('sample.xlsx')
df_data.fillna("NO DATA",inplace=True) ## Replaced nan values with "NO DATA"
unique_ids = df_data.ods_service_ids.unique()
#Storing pd into a list
records_list = df_data.to_dict('records')
keys_to_check = ['service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']
processed = {}
#Go through unique ids
for key in unique_ids:
processed[key] = {}
#Get related records
matching_records = [y for y in records_list if y['ods_service_ids'] == key]
#Loop through records
for record in matching_records:
#For each key to check, save in dict if non null
processed[key]['ods_service_ids'] = key
for detail_key in keys_to_check:
if record[detail_key] != "NO DATA" :
processed[key][detail_key] = record[detail_key]
##Note : doesn't handle duplicate values for different keys so far
#Records are put back in list
output_data = [processed[x] for x in processed.keys()]
# -> to Pandas
df = pd.DataFrame(output_data)[['ods_service_ids','service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']]
#Export to Excel
df.to_excel("output.xlsx",sheet_name='Sheet_name_1', index=False)
The above should work but I wasn't really sure on how you wanted to save duplicated records for the same id. Do you look to store them as DRIVE_0, DRIVE_1, DRIVE_2 ?
EDIT:
df could be exported in a different way. Replaced below #export to Excel with the following :
df.to_excel("output.xlsx",sheet_name='Sheet_name_1')
EDIT 2:
with no input data it was hard to see any flows. Corrected the code above with fake data
To be honest, I think you've managed to get confused by data structures and come up with something far more complicated than you need.
One approach that would suit would be to use Python dictionaries for each service, updating them row by row.
wb = load_workbook("sample.xlsx")
ws = wb.active
objs = {}
headers = next(ws.iter_rows(min_row=1, max_row=1, values_only=True))
for row in ws.iter_rows(min_row=2, values_only=True):
if row[0] not in objs:
obj = {key:value for key, value in zip(headers, row)}
objs[obj['ods_service_id']] = obj
else:# update dict with non-None values
extra = {key:value for key, value in zip(headers[3:], row[3:]) if value != "NULL"}
obj.update(extra)
# write to new workbook
wb2 = Workbook()
ws2 = wb2.active
ws2.append(headers)
for row in objs.values(): # do they need sorting?
ws2.append([obj[key] for key in headers])
Note how you can do everything without using counters.

How to filter column data using openpyxl

I am trying to apply a filter to an existing Excel file, and export it to another Excel file. I would like to extract rows that only contain the value 16, then export the table to another excel file (as shown in the picture below).
I have tried reading the openpyxl documentation multiple times and googling for solutions but I still can't make my code work. I have also attached the code and files below
import openpyxl
# Is use to create a reference of the Excel to wb
wb1 = openpyxl.load_workbook('test_data.xlsx')
wb2 = openpyxl.load_workbook('test_data_2.xlsx')
# Refrence the workbook to the worksheets
sh1 = wb1["data_set_1"]
sh2 = wb2["Sheet1"]
sh1.auto_filter.ref = "A:A"
sh1.auto_filter.add_filter_column(0, ["16"])
sh1.auto_filter.add_sort_condition("B2:D6")
sh1_row_number = sh1.max_row
sh1_col_number = sh1.max_column
rangeSelected = []
for i in range(1, sh1_row_number+1, 1):
rowSelected = []
for j in range(1, sh1_col_number+1, 1):
rowSelected.append(sh1.cell(row = i, column = j))
rangeSelected.append(rowSelected)
del rowSelected
for i in range(1, sh1_row_number+1, 1):
for j in range(1, sh1_col_number+1, 1):
sh2.cell(row = i, column = j).value = rangeSelected[i-1][j-1].value
wb1.save("test_data.xlsx")
wb2.save("test_data_2.xlsx")
The pictures shows what should be the desire result
The auto filter doesn't actually filter the data, it is just for visualization.
You probably want to filter while looping through the workbook. Please note with this code I assume you have the table headers already in the second workbook. It does not overwrite the data, it appends to the table.
import openpyxl
# Is use to create a reference of the Excel to wb
wb1 = openpyxl.load_workbook('test_data.xlsx')
wb2 = openpyxl.load_workbook('test_data_2.xlsx')
# Refrence the workbook to the worksheets
sh1 = wb1["data_set_1"]
sh2 = wb2["data_set_1"] # use same sheet name, different workbook
for row in sh1.iter_rows():
if row[0].value == 16: # filter on first column with value 16
sh2.append((cell.value for cell in row))
wb1.save("test_data.xlsx")
wb2.save("test_data_2.xlsx")

Using xlwt vs openpyxl

I need some help with openpyxl in PYTHON. I have been using xlwt quite successfully, but now I have some files (in MySQL Workbench) that contain more than 65,000 rows. I know I can create a CSV file, but XLSX is the preferred output. I am able to create a workbook using openpyxl, but I have not been successful placing the MySQL data into the table. The main portion of the program using xlwt is pretty straightforward (see below). I just cannot seem to figure out how to do the same thing using openpyxl. I've tried a number of different combinations and solutions. I just get stuck after the "for x in result:".
file_dest = "c:\home\test.xls"
result = dest.execute("select a, b, c, d from filea")
for x in result:
rw = rw + 1
sheet1 = book.add.sheet("Sheet 1")
row1 = sheet1.row(rw)
row1.write(1, x[0])
row1.write(1, x[1])
row1.write(1, x[2])
row1.write(1, x[3])
book.save(file_dest)
This is a good use case for using append():
Appends a group of values at the bottom of the current sheet.
If it’s a list: all values are added in order, starting from the first
column
import openpyxl
file_dest = "test.xlsx"
workbook = openpyxl.Workbook()
worksheet = workbook.get_active_sheet()
result = dest.execute("select a, b, c, d from filea")
for x in result:
worksheet.append(list(x))
workbook.save(file_dest)
A little example:
wb = Workbook(encoding='utf-8')
ws = wb.worksheets[0]
row = 2
ws.title = "Report"
ws.cell('A1').value = "Value"
ws.cell('B1').value = "Note"
for item in results:
ws.cell('A%d' % (row)).value = item[0]
ws.cell('B%d' % (row)).value = item[1]
row += 1
http://pythonhosted.org//openpyxl/

Categories