I have 2 different excel files with different number of rows and columns. I have to compare the amounts in both the excel sheets based on the unique ids and if there is any change in the value then I have to fetch those results and write the rows in a new excel file.Also If there is any new entry in 2nd excel then also need to copy the data into new excel. number of rows are different in both the files. I tried following approach but its not working and returning TypeError: 'Book' object is not subscriptable for the and condition in the if statement and if I am only iterating the rows without checking the same indexes then it is returning rows missing in the results
from itertools import zip_longest
import xlrd
rb1 = xlrd.open_workbook('./first_file1.xlsx')
rb2 = xlrd.open_workbook('./other_file1.xlsx')
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2``.sheet_by_index(0)
for rownum in range(max(sheet1.nrows, sheet2.nrows)):
if (rownum < sheet2.nrows) and (rb1[0] == rb2[0]):
row_rb1 = sheet1.row_values(rownum)
row_rb2 = sheet2.row_values(rownum)
for colnum, (c1, c2) in enumerate(zip_longest(row_rb1, row_rb2)):
if c1 != c2:
print ("Cell {}{} {} != {}".format(rownum+1, xlrd.formula.colname(colnum), c1, c2))
else:
print ("Row {} missing".format(rownum+1))
You can try this:
rb1 = xlrd.open_workbook('./first_file1.xlsx')
rb2 = xlrd.open_workbook('./other_file1.xlsx')
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2.sheet_by_index(0)
new_df = []
for i, rownum_sheet2 in enumerate(range(sheet2.nrows)): #go through the (possible longer) sheet2
row_rb2 = sheet2.row_values(rownum_sheet2)
for rownum_sheet1 in range(sheet1.nrows): #go through sheet1 and check for same id
row_rb1 = sheet1.row_values(rownum_sheet1)
if row_rb1[0] == row_rb2[0]: #if the row with the same id is not equal: append to new df
if row_rb1 != row_rb2:
new_df.append(row_rb2)
if i >= sheet1.nrows: #if there are extra rows, append to new df
new_df.append(row_rb2)
#write new df to new excel-file
New Code:
df1_1 = pd.read_table('.../first_file.txt', sep = '/t')
df1_1.to_excel('filename1.xlsx')
df_first_file = pd.concat([df1_1['Column'].str.split(' ',expand=True)],axis=1)
df_new1 = df_first_file.to_excel('first_file1.xlsx')
df1 = pd.read_table('.../otherfile.txt', sep = '/t')
df1.to_excel('filename2.xlsx')
df_otherfile = pd.concat([df1['Column'].str.split(' ',expand=True)],axis=1)
df_new2 = df_otherfile.to_excel('other_file1.xlsx')
new_df = []
for i, rownum_sheet2 in enumerate(range(df_new2.nrows)): #go through the (possible longer) sheet2
row_rb2 = df_new2.row_values(rownum_sheet2)
for rownum_sheet1 in range(df_new1.nrows): #go through sheet1 and check for same id
row_rb1 = df_new1.row_values(rownum_sheet1)
if row_rb1[0] == row_rb2[0]: #if the row with the same id is not equal: append to new df
if row_rb1 != row_rb2:
new_df.append(row_rb2)
if i >= df_new1.nrows: #if there are extra rows, append to new df
new_df2 = new_df.append(row_rb2)
print (new_df2)
new_df.to_excel('final_filename.xlsx')
Related
Is it possible to create a python script to automatic which is subtract cell value with 2 worksheet in one excel file?
I have checked some documents, and seem that use the method of pandas or openpyxl to do so. But I can't to do that. Do you have any suggestion to me? Many thanks.
Script:
from datetime import datetime
import pandas as pd
import openpyxl as xl;
currDateTime = datetime.now()
Sheet1 ="C:\\Users\\peter\\Downloads\\" + currDateTime.strftime('%Y%m%d') + "\\5250A" + "\\5250A.xlsx"
wb3 = xl.load_workbook(Sheet1)
ws3 = wb3.worksheets[0]
wb4 = xl.load_workbook(Sheet1)
ws4 = wb4.worksheets[1]
wb5 = xl.load_workbook(Sheet1)
ws5 = wb5.create_sheet("Done")
wb4.subtract(wb3)
wb5.save(str(Sheet1))
Expected Result:
Do so in excel coule be way easier I think. There could be a smarter way to write this code.
[NOTE] I just do the subsctraction cell by cell, so if there's any mismatch like same row but different dept.id or same col but different item will make errors. If you may meet this situation, you'll have a change some in the following code.
import openpyxl as xl
def get_row_values(worksheet):
"""
return data structure:
[
[A1, B1, C1, ...],
[A2, B2, C2, ...],
...
]
"""
result = []
for i in worksheet.rows:
row_data = []
for j in i:
row_data.append(j.value)
result.append(row_data)
return result
if __name__ == '__main__':
# load excel file
wb = xl.load_workbook('test1.xlsx')
ws1 = wb.worksheets[0]
ws2 = wb.worksheets[1]
# get data from the first 2 worksheets
ws1_rows = get_row_values(ws1)
ws2_rows = get_row_values(ws2)
# calculate and make a new sheet
ws_new = wb.create_sheet('Done')
# insert header
ws_new.append(ws1_rows[0])
for row in range(1, len(ws1_rows)):
# do the substract cell by cell
row_data = []
for column, value in enumerate(ws1_rows[row]):
if column == 0:
# insert first column
row_data.append(value)
else:
if ws1_rows[row][0] == ws2_rows[row][0]:
# process only when first column match
row_data.append(value - ws2_rows[row][column])
ws_new.append(row_data)
wb.save('test2.xlsx')
here's my sample excel file
first sheet:
second sheet:
generated sheet:
I'm currently facing an issue where I need to bring all of the data shown in the images below into one line only.
So using Python and Openpyxl, I tried to write a parsing script that reads the line and only copies when values are non-null or non-identical, into a new workbook.
I get out of range errors, and the code does not keep just the data I want. I've spent multiple hours on it, so I thought I would ask here to see if I can get unstuck.
I've read some documentation on Openpyxl and about making lists in python, tried a couple of videos on youtube, but none of them did exactly what I was trying to achieve.
import openpyxl
from openpyxl import Workbook
path = "sample.xlsx"
wb = openpyxl.load_workbook(path)
ws = wb.active
path2 = "output.xlsx"
wb2 = Workbook()
ws2 = wb2.active
listab = []
rows = ws.max_row
columns = ws.max_column
for i in range (1, rows+1):
listab.append([])
cellValue = " "
prevCell = " "
for c in range (1, rows+1):
for r in range(1, columns+1):
cellValue = ws.cell(row=r, column=c).value
if cellValue == prevCell:
listab[r-1].append(prevCell)
elif cellValue == "NULL":
listab[r-1].append(prevCell)
elif cellValue != prevCell:
listab[r-1].append(cellValue)
prevCell = cellValue
for r in range(1, rows+1):
for c in range (1, columns+1):
j = ws2.cell(row = r, column=c)
j.value = listab[r-1][c-1]
print(listab)
wb2.save("output.xlsx")
There should be one line with the below information:
ods_service_id | service_name| service_plan_name| CPU | RAM | NIC | DRIVE |
Personally I would go with pandas.
import pandas as pd
#Loading into pandas
df_data = pd.read_excel('sample.xlsx')
df_data.fillna("NO DATA",inplace=True) ## Replaced nan values with "NO DATA"
unique_ids = df_data.ods_service_ids.unique()
#Storing pd into a list
records_list = df_data.to_dict('records')
keys_to_check = ['service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']
processed = {}
#Go through unique ids
for key in unique_ids:
processed[key] = {}
#Get related records
matching_records = [y for y in records_list if y['ods_service_ids'] == key]
#Loop through records
for record in matching_records:
#For each key to check, save in dict if non null
processed[key]['ods_service_ids'] = key
for detail_key in keys_to_check:
if record[detail_key] != "NO DATA" :
processed[key][detail_key] = record[detail_key]
##Note : doesn't handle duplicate values for different keys so far
#Records are put back in list
output_data = [processed[x] for x in processed.keys()]
# -> to Pandas
df = pd.DataFrame(output_data)[['ods_service_ids','service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']]
#Export to Excel
df.to_excel("output.xlsx",sheet_name='Sheet_name_1', index=False)
The above should work but I wasn't really sure on how you wanted to save duplicated records for the same id. Do you look to store them as DRIVE_0, DRIVE_1, DRIVE_2 ?
EDIT:
df could be exported in a different way. Replaced below #export to Excel with the following :
df.to_excel("output.xlsx",sheet_name='Sheet_name_1')
EDIT 2:
with no input data it was hard to see any flows. Corrected the code above with fake data
To be honest, I think you've managed to get confused by data structures and come up with something far more complicated than you need.
One approach that would suit would be to use Python dictionaries for each service, updating them row by row.
wb = load_workbook("sample.xlsx")
ws = wb.active
objs = {}
headers = next(ws.iter_rows(min_row=1, max_row=1, values_only=True))
for row in ws.iter_rows(min_row=2, values_only=True):
if row[0] not in objs:
obj = {key:value for key, value in zip(headers, row)}
objs[obj['ods_service_id']] = obj
else:# update dict with non-None values
extra = {key:value for key, value in zip(headers[3:], row[3:]) if value != "NULL"}
obj.update(extra)
# write to new workbook
wb2 = Workbook()
ws2 = wb2.active
ws2.append(headers)
for row in objs.values(): # do they need sorting?
ws2.append([obj[key] for key in headers])
Note how you can do everything without using counters.
My python is rudimentary. What I want it to do is take the first dataframe, search for a unique number and create a new df in the same formatted template, the use the same unique number, search through the second df and create a new df pertinent to that unique number in the specified format, then merge all the looped data one top of each other.
This is the code
#function
def multiple_dfs(df_list, sheets, file_name, spaces):
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
row = 0
for i in uniqueIR:
dftopi = df_out[df_out['Invoice Reference Number'] == i]
df2 = df_out_fin[df_out_fin['Invoice Reference Number'] == i]
df3 = df2.drop(columns = ['Invoice Reference Number'])
for dataframe in df_list:
dataframe.to_excel(writer,sheet_name=sheets,startrow=row , startcol=0, index = False)
row = row + len(dataframe.index) + spaces
writer.save()
# list of dataframes
dfs = [dftopi,df3]
# run function
multiple_dfs(dfs, 'Validation', 'test1.xlsx', 1)
This is what I want:
table output
Figured out a solution if anyone in the future is wonder:
writer = pd.ExcelWriter('test3.xlsx', engine = 'xlsxwriter')
dflist = []
for i in uniqueIR:
dftopi = df_out[df_out['Invoice Reference Number'] == i]
df2 = df_out_fin[df_out_fin['Invoice Reference Number'] == i]
df3 = df2.drop(columns = ['Invoice Reference Number'])
dftopi.to_excel(writer, sheet_name = 'Top Half' + str(i), index = False)
df3.to_excel(writer, sheet_name = 'Bottom Half' + str(i), index = False)
dflist.append(dftopi)
dflist.append(df3)
writer.save()
def multiple_dfs(df_list, sheets, file_name, spaces):
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
row = 0
for dataframe in df_list:
dataframe.to_excel(writer,sheet_name=sheets,startrow=row , startcol=0, index = False)
row = row + len(dataframe.index) + spaces
writer.save()
multiple_dfs(dflist, 'Validation', 'test4.xlsx', 1)
I have been struggling with this code all day. During each run of the loop, a table is read from a different MS Word file. The table is copied to a dataframe and then it is copied to a row in an Excel file.
With each subsequent run of the for-loop, the Excel row is incremented so the new dataframe can be written to a new row, but after the file executes only one row is showing a dataframe.
When I print(tfile), I get the following .. ('CIV-ASCS-016_TRS.docx', 'CIV-ASCS-018_TRS .docx', 'CIV-ASCS-020_TRS.docx', 'CIV-ASCS-021_TRS .docx') This proves that loop ran 4 times based on 4 files in the directory. I set the initial row pos to 0 outside of the for-loop.
Note: I am not showing any lines of code with regards to importing the necessary libraries.
files = glob('*.docx')
pos = 1
for i, wfile in enumerate(files[:1]):
document = Document(wfile)
table = document.tables[0]
data = []
keys = {}
for j, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if j == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
tfile = tuple(files)
df = pd.DataFrame(data)
df.loc[-1] = [wfile, 'Test Case ID']
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
df1 = df.rename(index=str, columns={"Test Case ID": "TC Attributes"})
df21 = df1.drop(columns = ['TC Attributes'])
df3 = df21.T
# read the existing sheets so that openpyxl won't create a new one later
book = load_workbook('test.xlsx')
writer = pd.ExcelWriter('test.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df3.to_excel(writer, 'sheet7', header = False, index = False, \
startrow = pos)
pos += 1
writer.save()
import xlrd
workbook = xlrd.open_workbook(filename)
sheet = workbook.sheet_by_index(0)
array = []
for i in range(2, 9):
array.append([sheet.cell(i, j).value for j in range(2, 5)])
Excel Image
I have this code and it works fine, but it's not doing what I want it to do. It is pulling the data from all the three columns of that excel file (see excel image). I only want it to pull data from column C and column E, and store that as a pair in the array. How to do that? I know there is something like skip columns and skip rows in python, but not sure how to embed that in the code I have.
Using openpyxl :-
def iter_rows(ws):
result=[]
for row in ws.iter_rows():
rowlist = []
for cell in row:
rowlist.append(cell.value)
result.append(rowlist)
return result
wb = load_workbook(filename = '/home/piyush/testtest.xlsx')
ws = wb.active
first_sheet = wb.get_sheet_names()[0]
print first_sheet
worksheet = wb.get_sheet_by_name(first_sheet)
fileList = (list(iter_rows(worksheet)))
col1 = []
col2 = []
for col in fileList:
col1.append(col[1])#1 is column index
col2.append(col[2])#2 is column index
for a in zip(col1,col2):
print a
#append as pair in another array
using pandas:-
xl = pd.ExcelFile("/home/piyush/testtest.xlsx")
df = xl.parse("Sheet1")
df.iloc[:,[col1Index,col1Index]]