let me describe my issue below:
I've got got two excel worksheets, one containing past, the other - current data. They both have the following structure:
Col_1
Col_2
KEY
Col_3
Etc.
abc
xyz
key_1
foo
---
def
zyx
key_2
bar
---
Now, the goal is to check if a value for given key changed between the past and current iteration and if yes, color the given cell's background (in current data worksheet). This check has to be done for all the columns.
As the KEY column is not the very first one, I've decided to use XLOOKUP function and apply the formatting within the for loop. The full loop looks like this (in this example the KEY column is column C):
dark_blue = writer.book.add_format({'bg_color': '#3A67B8'})
old_sheet = "\'" + "old_" + "sheet_name" + "\'"
for col in range(last_col):
col_name = xl_col_to_name(col)
if col_name in unformatted_cols: # Not apply the formatting to certain columns
continue
else:
apply_range = '{0}1:{0}1048576'.format(col_name)
formula = "XLOOKUP(C1, {1}!C1:C1048576, {1}!{0}1:{0}1048576) <> XLOOKUP(C1, C1:C1048576, {0}1:{0}1048576)".format(col_name, old_sheet)
active_sheet.conditional_format(apply_range, {'type': 'formula',
'criteria': formula,
'format': dark_blue})
Now, my problem is that when I open the output the this conditional formatting doesn't work. If however I'll go to Conditional Formatting -> Manage Rules -> Edit Rule and without any editing I'll press OK and later apply it starts working correctly.
Does anyone know how to make this rule work properly without this manual intervention?
My all other conditional formatting rules, though simpler, work exactly as intended.
# This is the formula that I see in Python for the first loop iteration
=XLOOKUP(C1, 'old_sheet_name'!C1:C1048576, 'old_sheet_name'!A1:A1048576) <> XLOOKUP(C1, C1:C1048576, A1:A1048576)
# This formula I see in Excel for the same first column
=XLOOKUP(C1, 'old_sheet_name'!C:C, 'old_sheet_name'!A:A) <> XLOOKUP(C1, C:C, A:A)
The reason that XLOOKUPdoesn't work in your formula is that it is classified by Excel as a "Future Function", i.e, a function added after the original file format. In order to use it you need to prefix it with _xlfn.
This is explained in the XlsxWriter docs on Formulas added in Excel 2010 and later.
Here is a working example:
import xlsxwriter
workbook = xlsxwriter.Workbook('conditional_format.xlsx')
worksheet1 = workbook.add_worksheet('old_sheet_name')
worksheet2 = workbook.add_worksheet('new_sheet_name')
worksheet1.write(0, 0, 'Foo')
format1 = workbook.add_format({'bg_color': '#C6EFCE',
'font_color': '#006100'})
xlookup_formula = '=_xlfn.XLOOKUP(C1, old_sheet_name!C:C, old_sheet_name!A:A) <> _xlfn.XLOOKUP(C1, C:C, A:A)'
worksheet2.conditional_format('D1:D10',
{'type': 'formula',
'criteria': xlookup_formula,
'format': format1})
workbook.close()
Output:
Related
I am writing a Python script using gspread to read and write data to a Google Sheet. One of the tasks that this script has to do is set the background color of any empty cell in a given column to black. I am attempting to read the sheet data just once into a dictionary called data using data = sheet.get_all_records(), then formatting the cell based on the dictionary's data, however I am running into a problem. worksheet.format() seems to only accept cells in 'A1' notation (that's all that it shows in the docs about formatting cells, and any attempts at other formats throws errors), but the cell object that I have by calling sheet.find(userDict[column]), where userDict[] is an element of data, returns its row and column as an integer. Is there a nice way to have a cell object return its position in 'A1' notation? If not, I suspect the easiest way would be to convert cell.col to the appropriate character using ASCII values, but I'm hoping there's a more elegant way.
Thanks in advance!
I believe your goal is as follows.
For the specific column, you want to set the background color of the cells which are empty.
You want to achieve this using gspread for python.
In this case, how about the following flow?
Retrieve all values from the sheet.
Retrieve the row numbers of the cells which are empty.
Set the background color of the cells using batchUpdate method.
When this flow is reflected in the script, it becomes as follows.
Sample script:
client = gspread.authorize(credentials) # Please use your authorization script.
spreadsheetId = "###" # Please set your Spreadsheet ID.
sheetName = "Sheet1" # Please set the sheet name.
checkColumns = [1, 3] # Please set the column number. In this sample, 1 and 3 means columns "A" and "C", respectively.
# 1. Retrieve all values from the sheet.
spreadsheet = client.open_by_key(spreadsheetId)
sheet = spreadsheet.worksheet(sheetName)
data = sheet.get_all_values()
# 2. Retrieve the row numbers of the cells which are the empty.
transposed = [list(e) for e in zip(*data)]
rowNumbers = [[i for i, v in enumerate(transposed[e - 1]) if not v] for e in checkColumns]
# 3. Set background color of the cells using batchUpdate method.
requests = []
for i, e in enumerate(checkColumns):
if rowNumbers[i] != []:
for f in rowNumbers[i]:
requests.append({
"updateCells": {
"range": {"sheetId": sheet.id, "startRowIndex": f, "endRowIndex": f + 1, "startColumnIndex": e - 1, "endColumnIndex": e},
"rows": [{"values": [{"userEnteredFormat": {"backgroundColor": {"red": 0, "green": 0, "blue": 0}}}]}],
"fields": "userEnteredFormat.backgroundColor"
}
})
res = spreadsheet.batch_update({"requests": requests})
When this script is run, the background color of the empty cells of the specific columns is set as the black color.
References:
batch_update(body)
UpdateCellsRequest
I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.
preface: I'm new and self taught. This is my first coding project. I know it's terrible. I'm going to rewrite it once it's complete and working.
I'm trying to write a python script that will compare 2 excel files and highlight the cells that are different. I can print out the differences (using pandas) and highlight a cell (only by hard coding a specific cell). I can't figure out how to highlight cells based on the printed out differences.
df1 = pd.read_excel(mxln) # Loads master xlsx for comparison
df2 = pd.read_excel(sfcn) # Loads student xlsx for comparison
print('If NaN, correct. If not NaN, incorrect')
difference = df2[df2 != df1] # Scans for differences
print(difference)
lmfh = load_workbook(mxln) # Load mxln for highlight
lsfh = load_workbook(sfcn) # Load sfcn for highlight
lmws = lmfh.active
lsws = lsfh.active
redFill = PatternFill(start_color='FFEE1111', end_color='FFEE1111', fill_type='solid')
lsws['A1'].fill = redFill # Hardcoded cell color
lsfh.save(sfcn)
This is only the part of the code I'm stuck on. I can post the rest if necessary.
You can use the style to add highlighting to your dataframe in pandas.
df2.style.apply(highlight_differences)
Then you can write a function which sets the highlighting criteria
def highlight_differences():
# check for differences here
return ['background-color: yellow']
I tried to set some groups in xlsxwriter but it seems I can't get the + Symbol at the top of my group. As it is setting the grouping row-wise I tried to write a function that takes the workbook, the worksheet index and a start/end as well as the level. But whatever I do the grouping symbol in excel never appears in the first row. Funny thing is, if I use something like start_row + 5 for the collapsing row I will get a second + at the right row, but still another one at the end. Does anyone know if its possible?
The examples show grouping just for the last row of a group.
def set_group(out_wb, ws_index, start_row, end_row, level):
#added as i used an offset
start_row = start_row
end_row = end_row
out_wb.worksheets()[ws_index].set_row(start_row, None, None, {'level': level, 'collapsed': True})
for i in range(start_row + 1, end_row):
out_wb.worksheets()[ws_index].set_row(i, None, None, {'level': level, 'hidden': True})
return out_wb
You can use the outline_settings() worksheet method and set symbols_below to False:
worksheet.outline_settings(symbols_below=False)
I'm with xlsxwriter in version 0.5.5 and the command line is
worksheet.outline_settings(outline_below=False)
I'm trying to set conditional formatting in openpyxl to emulate highlighting duplicate values. With this simple code, I should be able to highlight consecutive duplicates (but not the first value in a duplicate sequence).
from pandas import *
data = DataFrame({'a':'a a a b b b c b c a f'.split()})
wb = ExcelWriter('test.xlsx')
data.to_excel(wb)
ws = wb.sheets['Sheet1']
from openpyxl.style import Color, Fill
# Create fill
redFill = Fill()
redFill.start_color.index = 'FFEE1111'
redFill.end_color.index = 'FFEE1111'
redFill.fill_type = Fill.FILL_SOLID
ws.conditional_formatting.addCellIs("B1:B1048576", 'equal', "=R[1]C", True, wb.book, None, None, redFill)
wb.save()
However, when I open it in Excel I get an error related to conditional formatting, and the data is not highlighted as expected. Is openpyxl able to handle R1C1 style referencing?
In regards to highlighting to find duplicates of sequential values, the formula you want is
=AND(B1<>"",B2=B1)
With a range starting from B2 (aka, B2:B1048576)
Note - this appears to be broken in the current 1.8.3 branch of openpyxl, but will be fixed shortly in the 1.9 branch.
from openpyxl import Workbook
from openpyxl.style import Color, Fill
wb = Workbook()
ws = wb.active
ws['B1'] = 1
ws['B2'] = 2
ws['B3'] = 3
ws['B4'] = 3
ws['B5'] = 7
ws['B6'] = 4
ws['B7'] = 7
# Create fill
redFill = Fill()
redFill.start_color.index = 'FFEE1111'
redFill.end_color.index = 'FFEE1111'
redFill.fill_type = Fill.FILL_SOLID
dxfId = ws.conditional_formatting.addDxfStyle(wb, None, None, redFill)
ws.conditional_formatting.addCustomRule('B2:B1048576',
{'type': 'expression', 'dxfId': dxfId, 'formula': ['AND(B1<>"",B2=B1)']})
wb.save('test.xlsx')
As a further reference:
If you want to highlight all duplicates:
COUNTIF(B:B,B1)>1
If you want to highlight all duplicates except for the first occurence:
COUNTIF($B$2:$B2,B2)>1
If you to highlight sequential duplicates, except for the last one:
COUNTIF(B1:B2,B2)>1
Regarding RC notation - while openpyxl doesn't support excel RC notation, conditional formatting will write the formula as provided. Unfortunately, excel enables R1C1 notation only superficially as a flag, and converts all the formulas back to their A1 equivalent when saving, meaning you'd need a function to convert all R1C1 functions to their A1 equivalents for this to work.
Openpyxl doesn't support Excel RC notation.
You could use A1 notation instead which would mean that the equivalent formula is =B2 (I think).
However, you should verify that it actually works in Excel first.
My feeling is that it won't. In general conditional formatting uses absolute cell references $B$2 instead of relative cell references B1.
If it does work then convert your formula to A1 notation and that should work in Openpyxl.
You can't use R1C1 notation directly, and this answer would be a terrible way to format a range of cells, but OpenPyXL does allow you to use row and column numbers.
cell = ws.cell(r, c)
returns the worksheet cell at row r and column c, creating one if needed. Unlike the old xlrd/xlwt modules, row and column indices begin at 1, so you can read r and c directly off of a spreadsheet using the R1C1 reference style. For most purposes, you want to access .value, for example:
ws.cell(2, 3).value = 3
...
v = ws.cell(4, 5).value
It's not nearly as pretty as ws['R2C3'] = 3 or v = ws['R4C5'], but it helps with simple tasks.