preface: I'm new and self taught. This is my first coding project. I know it's terrible. I'm going to rewrite it once it's complete and working.
I'm trying to write a python script that will compare 2 excel files and highlight the cells that are different. I can print out the differences (using pandas) and highlight a cell (only by hard coding a specific cell). I can't figure out how to highlight cells based on the printed out differences.
df1 = pd.read_excel(mxln) # Loads master xlsx for comparison
df2 = pd.read_excel(sfcn) # Loads student xlsx for comparison
print('If NaN, correct. If not NaN, incorrect')
difference = df2[df2 != df1] # Scans for differences
print(difference)
lmfh = load_workbook(mxln) # Load mxln for highlight
lsfh = load_workbook(sfcn) # Load sfcn for highlight
lmws = lmfh.active
lsws = lsfh.active
redFill = PatternFill(start_color='FFEE1111', end_color='FFEE1111', fill_type='solid')
lsws['A1'].fill = redFill # Hardcoded cell color
lsfh.save(sfcn)
This is only the part of the code I'm stuck on. I can post the rest if necessary.
You can use the style to add highlighting to your dataframe in pandas.
df2.style.apply(highlight_differences)
Then you can write a function which sets the highlighting criteria
def highlight_differences():
# check for differences here
return ['background-color: yellow']
Related
let me describe my issue below:
I've got got two excel worksheets, one containing past, the other - current data. They both have the following structure:
Col_1
Col_2
KEY
Col_3
Etc.
abc
xyz
key_1
foo
---
def
zyx
key_2
bar
---
Now, the goal is to check if a value for given key changed between the past and current iteration and if yes, color the given cell's background (in current data worksheet). This check has to be done for all the columns.
As the KEY column is not the very first one, I've decided to use XLOOKUP function and apply the formatting within the for loop. The full loop looks like this (in this example the KEY column is column C):
dark_blue = writer.book.add_format({'bg_color': '#3A67B8'})
old_sheet = "\'" + "old_" + "sheet_name" + "\'"
for col in range(last_col):
col_name = xl_col_to_name(col)
if col_name in unformatted_cols: # Not apply the formatting to certain columns
continue
else:
apply_range = '{0}1:{0}1048576'.format(col_name)
formula = "XLOOKUP(C1, {1}!C1:C1048576, {1}!{0}1:{0}1048576) <> XLOOKUP(C1, C1:C1048576, {0}1:{0}1048576)".format(col_name, old_sheet)
active_sheet.conditional_format(apply_range, {'type': 'formula',
'criteria': formula,
'format': dark_blue})
Now, my problem is that when I open the output the this conditional formatting doesn't work. If however I'll go to Conditional Formatting -> Manage Rules -> Edit Rule and without any editing I'll press OK and later apply it starts working correctly.
Does anyone know how to make this rule work properly without this manual intervention?
My all other conditional formatting rules, though simpler, work exactly as intended.
# This is the formula that I see in Python for the first loop iteration
=XLOOKUP(C1, 'old_sheet_name'!C1:C1048576, 'old_sheet_name'!A1:A1048576) <> XLOOKUP(C1, C1:C1048576, A1:A1048576)
# This formula I see in Excel for the same first column
=XLOOKUP(C1, 'old_sheet_name'!C:C, 'old_sheet_name'!A:A) <> XLOOKUP(C1, C:C, A:A)
The reason that XLOOKUPdoesn't work in your formula is that it is classified by Excel as a "Future Function", i.e, a function added after the original file format. In order to use it you need to prefix it with _xlfn.
This is explained in the XlsxWriter docs on Formulas added in Excel 2010 and later.
Here is a working example:
import xlsxwriter
workbook = xlsxwriter.Workbook('conditional_format.xlsx')
worksheet1 = workbook.add_worksheet('old_sheet_name')
worksheet2 = workbook.add_worksheet('new_sheet_name')
worksheet1.write(0, 0, 'Foo')
format1 = workbook.add_format({'bg_color': '#C6EFCE',
'font_color': '#006100'})
xlookup_formula = '=_xlfn.XLOOKUP(C1, old_sheet_name!C:C, old_sheet_name!A:A) <> _xlfn.XLOOKUP(C1, C:C, A:A)'
worksheet2.conditional_format('D1:D10',
{'type': 'formula',
'criteria': xlookup_formula,
'format': format1})
workbook.close()
Output:
I have this excel file, how to search cell have "-->" character then fill color for it
I try research but all are solution for DataFrame is available
In order to check the presence of any character in the string, use the SEARCH function in the ISNUMBER function.
=ISNUMBER(SEARCH("-->", E2))
This will return True if E2 contains "-->".
And then you can fill color in excel sheet's cell.
After that use PatternFill to color cell as follows:
pattern_fill = PatternFill(patterntype = 'solid', fgcolor = 'C64747')
worksheet['E2'].fill = pattern_fill
patterntype, fgcolor data can be used as per requirement.
I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.
I'm trying to display a correlation heatmap which has 100 columns. I was investigating some options that allow one to do the Excel-equivalent of "Freeze pane", but all of them that I've looked into drop the styling. How do I freeze the index/headers and also include styling?
corr = numeric_data.dropna(how='all').corr()
def condense_text(text,pos):
text = list(text)
for i in range(pos,len(text),pos):
text.insert(i, '\n')
return "".join(text)
corr.columns = [condense_text(c,8) for c in corr.columns.tolist()]
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
I'm trying to set conditional formatting in openpyxl to emulate highlighting duplicate values. With this simple code, I should be able to highlight consecutive duplicates (but not the first value in a duplicate sequence).
from pandas import *
data = DataFrame({'a':'a a a b b b c b c a f'.split()})
wb = ExcelWriter('test.xlsx')
data.to_excel(wb)
ws = wb.sheets['Sheet1']
from openpyxl.style import Color, Fill
# Create fill
redFill = Fill()
redFill.start_color.index = 'FFEE1111'
redFill.end_color.index = 'FFEE1111'
redFill.fill_type = Fill.FILL_SOLID
ws.conditional_formatting.addCellIs("B1:B1048576", 'equal', "=R[1]C", True, wb.book, None, None, redFill)
wb.save()
However, when I open it in Excel I get an error related to conditional formatting, and the data is not highlighted as expected. Is openpyxl able to handle R1C1 style referencing?
In regards to highlighting to find duplicates of sequential values, the formula you want is
=AND(B1<>"",B2=B1)
With a range starting from B2 (aka, B2:B1048576)
Note - this appears to be broken in the current 1.8.3 branch of openpyxl, but will be fixed shortly in the 1.9 branch.
from openpyxl import Workbook
from openpyxl.style import Color, Fill
wb = Workbook()
ws = wb.active
ws['B1'] = 1
ws['B2'] = 2
ws['B3'] = 3
ws['B4'] = 3
ws['B5'] = 7
ws['B6'] = 4
ws['B7'] = 7
# Create fill
redFill = Fill()
redFill.start_color.index = 'FFEE1111'
redFill.end_color.index = 'FFEE1111'
redFill.fill_type = Fill.FILL_SOLID
dxfId = ws.conditional_formatting.addDxfStyle(wb, None, None, redFill)
ws.conditional_formatting.addCustomRule('B2:B1048576',
{'type': 'expression', 'dxfId': dxfId, 'formula': ['AND(B1<>"",B2=B1)']})
wb.save('test.xlsx')
As a further reference:
If you want to highlight all duplicates:
COUNTIF(B:B,B1)>1
If you want to highlight all duplicates except for the first occurence:
COUNTIF($B$2:$B2,B2)>1
If you to highlight sequential duplicates, except for the last one:
COUNTIF(B1:B2,B2)>1
Regarding RC notation - while openpyxl doesn't support excel RC notation, conditional formatting will write the formula as provided. Unfortunately, excel enables R1C1 notation only superficially as a flag, and converts all the formulas back to their A1 equivalent when saving, meaning you'd need a function to convert all R1C1 functions to their A1 equivalents for this to work.
Openpyxl doesn't support Excel RC notation.
You could use A1 notation instead which would mean that the equivalent formula is =B2 (I think).
However, you should verify that it actually works in Excel first.
My feeling is that it won't. In general conditional formatting uses absolute cell references $B$2 instead of relative cell references B1.
If it does work then convert your formula to A1 notation and that should work in Openpyxl.
You can't use R1C1 notation directly, and this answer would be a terrible way to format a range of cells, but OpenPyXL does allow you to use row and column numbers.
cell = ws.cell(r, c)
returns the worksheet cell at row r and column c, creating one if needed. Unlike the old xlrd/xlwt modules, row and column indices begin at 1, so you can read r and c directly off of a spreadsheet using the R1C1 reference style. For most purposes, you want to access .value, for example:
ws.cell(2, 3).value = 3
...
v = ws.cell(4, 5).value
It's not nearly as pretty as ws['R2C3'] = 3 or v = ws['R4C5'], but it helps with simple tasks.