Pandas update excel file with formatting - python

I have an excel file which has multiple sheets and special formatting (colors, symbols, etc...).
In Python I know that I can read the file into a data frame, update certain columns and then write the file back but it looses all formatting and gets overwritten.
Is there a way to open the file and just update the values of certain columns, keeping other sheets untouched and formatting as is?

Yes you can use openpyxl in this way for example:
from openpyxl.reader.excel import load_workbook
wb = load_workbook(filename='mypath\myfile.xlsx')
ws = wb.worksheets[0]
ws.cell(coordinate="A1").value = 2
wb.save("mypath\myfile.xlsx")
Where the cell A1 has a particular format. Its format stays the same and only the value of the cell changes.
To read the value of the cell, you can use this:
ws.cell(row=row_number, column=column_number).value
To change values of a column with a for loop, this is an option:
new_data = ['a','b','c','d']
for index, cell in enumerate(ws['A']):
cell.value = new_data[index]

Related

Treat everything as raw string (even formulas) when reading into pandas from excel

So, I am actually handling text responses from surveys, and it is common to have responses that starts with -, an example is: -I am sad today.
Excel would interpret it as #NAMES?
So when I import the excel file into pandas using read_excel, it would show NAN.
Now is there any method to force excel to retain as raw strings instead interpret it at formula level?
I created a vba and assigning the entire column with text to click through all the cells in the column, which is slow if there is ten thousand++ data.
I was hoping it can do it at python level instead, any idea?
I hope, it works for your solution, use openpyxl to extract excel data and then convert it into a pandas dataframe
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = './formula_contains_raw.xlsx', ).active
print(wb.values)
# sheet_names = wb.get_sheet_names()[0]
# sheet_ranges = wb[name]
df = pd.DataFrame(list(wb.values)[1:], columns=list(wb.values)[0])
df.head()
It works for me using a CSV instead of excel file.
In the CSV file (opened in excel) I need to select the option Formulas/Show Formulas, then save the file.
pd.read_csv('draft.csv')
Output:
Col1
0 hello
1 =-hello

Python: adding a column to one sheet from an excel file

I'm trying to add just one empty column into one sheet of an excel file. The excel file that I'm using has a specific structure that I can't change. That being said the column right after where I am trying to insert has a very small width. The code I have below will not insert the column before that small column and after a standard size column. But when I adjust the index to be in between 2 standard size columns there's no issue.
How can I fix my code to not have this issue inserting a column or are there better methods?
from openpyxl import load_workbook
workbook = load_workbook('file.xlsx')
sheet= workbook.worksheets[8]
sheet.inset_cols(185)
workbook.save(filename= 'file.xlsx')

How to append data to the last row (every time) of an Excel file?

I am looking for a way to append data from a Python program to an excel sheet. For this, I chose the openpyxl library to save this data.
My problem is how to put new data in the excel file without losing the current data, in the last row of the sheet. I look into the documentation but I did not see any answer.
I do not know if this library has a method to add new data or I need to make a logic to this task.
The last row of the sheet can be found using max_row():
from openpyxl import load_workbook
myFileName=r'C:\DemoFile.xlsx'
#load the workbook, and put the sheet into a variable
wb = load_workbook(filename=myFileName)
ws = wb['Sheet1']
#max_row is a sheet function that gets the last row in a sheet.
newRowLocation = ws.max_row +1
#write to the cell you want, specifying row and column, and value :-)
ws.cell(column=1,row=newRowLocation, value="aha! a new entry at the end")
wb.save(filename=myFileName)
wb.close()
What you're looking for is the Worksheet.append method:
Appends a group of values at the bottom of the current sheet.
If it’s a list: all values are added in order, starting from the first column
If it’s a dict: values are assigned to the columns indicated by the keys (numbers or letters)
So no need to check for the last row. Just use this method to always add the data at the end.
ws.append(["some", "test", "data"])

Reading merged cells from Excel File & store them within dataframe (Using Python)

I have a table in an excel that contains many merged cells.
Given Excel Table
When I open it with python, I want to be able to turn it into a dataframe while maintaining data integrity. It should look like this:
Final Table
I've searched through and referred to a few of the different solutions here:
How to read merged cells in python using openpyxl?
How to read merged Excel cells with NaN into Pandas DataFrame
But none have worked.
Understand that fillna will fill up even cells that were originally not merged. I also read a little about merged_cell but am still unable to make it work.
import openpyxl
from openpyxl.utils import range_boundaries
from openpyxl import Workbook
from openpyxl import load_workbook
# wb = Workbook()
wb = openpyxl.load_workbook('file')
sheet = wb.active
mcells = sheet.merged_cells.ranges
# mcells gives a list of merged cells
# I want to go through the list of merged cells and use sheet.unmerge_cells() to unmerge them and fill the cells accordingly.
# This way I can ensure cells that are not merged will remain blank.
for k,v in enumerate(mcells):
print(v)
# How do I continue from here?
Is there a way to do convert the merged cells accurately?

pandas read excel values not formulas

Is there a way to have pandas read in only the values from excel and not the formulas? It reads the formulas in as NaN unless I go in and manually save the excel file before running the code. I am just working with the basic read excel function of pandas,
import pandas as pd
df = pd.read_excel(filename, sheetname="Sheet1")
This will read the values if I have gone in and saved the file prior to running the code. But after running the code to update a new sheet, if I don't go in and save the file after doing that and try to run this again, it will read the formulas as NaN instead of just the values. Is there a work around that anyone knows of that will just read values from excel with pandas?
That is strange. The normal behaviour of pandas is read values, not formulas. Likely, the problem is in your excel files. Probably your formulas point to other files, or they return a value that pandas sees as nan.
In the first case, the sheet needs to be updated and there is nothing pandas can do about that (but read on).
In the second case, you could solve by setting explicit nan values in read_excel:
pd.read_excel(path, sheetname="Sheet1", na_values = [your na identifiers])
As for the first case, and as a workaround solution to make your work easier, you can automate what you are doing by hand using xlwings:
import pandas as pd
import xlwings as xl
def df_from_excel(path):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path)
df = df_from_excel(path to your file)
If you want to keep those formulas in your excel file just save the file in a different location (book.save(different location)). Then you can get rid of the temporary files with shutil.
I had this problem and I resolve it by moving a graph below the first row I was reading. Looks like the position of the graphs may cause problems.
you can use xlrd to read the values.
first you should refresh your excel sheet you are also updating the values automatically with python. you can use the function below
file = myxl.xls
import xlrd
import win32com.client
import os
def refresh_file(file):
xlapp = win32com.client.DispatchEx("Excel.Application")
path = os.path.abspath(file)
wb = xlapp.Wordbooks.Open(path)
wb.RefreshAll()
xlapp.CalculateUntilAsyncqueriesDone()
wb.save()
xlapp.Quit()
after the file refresh, you can start reading the content.
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_index(0)
for rowid in range(worksheet.nrows):
row = worksheet.row(rowid)
for colid, cell in enumerate(row):
print(cell.value)
you can loop through however you need the data. and put conditions while you are reading the data. lot more flexibility

Categories