Convert XLSX to CSV without losing values from formulas - python

I've tried a few methods, including pandas:
df = pd.read_excel('file.xlsx')
df.to_csv('file.csv')
But every time I convert my xlsx file over to csv format, I lose all data within columns that include a formula. I have a formula that concatenates values from two other cells + '#domain' to create user emails, but this entire column returns blank in the csv product.
The formula is basically this:
=CONCATENATE(B2,".",E2,"#domain")
The conversion is part of a larger code workflow, but it won't work if this column is left blank. The only thing I've tried that worked was this API, but I'd rather not pay a subscription if this can be done locally on the machine.
Any ideas? I'll try whatever you throw at me - bear in mind I'm new to this, but I will do my best!

You can try to open the excel file with the openpyxl library in the data-only mode. This will prevent the raw formulas - they are going to be calculated just the way you see them in excel itself.
import openpyxl
wb = openpyxl.load_workbook(filename, data_only=True)
Watch out when youre working with you original file and save it with the openpyxl-lib in the data-only-mode all your formulas will be lost. I had this once and it was horrible. So i recommend using a copy of your file to work with.
Since you have your xlsx-file with values only you can now use the internal csv library to generate a proper csv-file (idea from this post: How to save an Excel worksheet as CSV):
import csv
sheet = wb.active # was .get_active_sheet()
with open('test.csv', 'w', newline="") as f:
c = csv.writer(f)
for r in sheet.iter_rows(): # generator; was sh.rows
c.writerow([cell.value for cell in r])

Related

What is an efficient way to make excel report from python?

For the report I use this code:
from openpyxl import load_workbook
ReportName = "tets.xlsx"
new_row_data = [
["value", 'value2', "value3"]]
wb = load_workbook(ReportName)
# Select Second Worksheet
ws = wb.worksheets[1]
# Append 1 or 2(if multiple) new Rows - Columns A - D
for row_data in new_row_data:
# Append Row Values
ws.append(row_data)
wb.save(ReportName) #save the file
It has no errors, but I don't understand why sometimes it doesn't make the report inside the excel file, it saves the excel file, then when I open it, I don't see the values.
Do you know a better way or more solid way to make the auto report?
I would recommend using CSV files, that can be loaded into excel. Python is really great at working with CSVs.
Here is a link that might be useful.
[https://docs.python.org/3/library/csv.html][1]

Convert XLSX to CVS in Python: Keep Values, not Formulas

TLDR: How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves?
My code combines two .xlsx sheets together to generate emails for new org users.
The first .xlsx contains a formula that concatenates the user's name and our domain, while the other .xlsx contains the queried list of new users. When combined, the newly generated .xlsx, titled 'users.xlsx' includes the desired information - but the emails generated are done so using the formula, still - not values. If asked to read data_only via pandas, it doesn't seem to work at all and no emails are generated on this newly created 'users' xlsx sheet.
This is all fine and works well, but the final step is converting the .xlsx over to .csv
Because the emails are technically generated through the concatenating formula, the conversion doesn't preserve the user's emails.
How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves? Is this possible? Can I force the third .xlsx to preserve values only and then do the conversion?
Things I've tried (While they all successfully convert into a .cvs, the data within formulas is lost):
Lenged:
combined_xlsx_2
# The .xlsx product after combining two xlsx (user info + emails)
# This product is 'users.xlsx' - I need it converted to a .csv
Code 1:
# Read and store content
# of an excel file
read_file = pd.read_excel (combined_xlsx_2)
# Write the dataframe object
# into csv file
filedir = combined_xlsx_2.replace("users_2.xlsx","users.csv")
read_file.to_csv (filedir,
index = None,
header=True,
encoding='utf-8')
# read csv file and convert
# into a dataframe object
df = pd.DataFrame(pd.read_csv(filedir))
df
Code 2:
filename = (combined_xlsx_2)
filedir = (filename.replace("/users.xlsx",""))
path_to_excel_files = glob.glob(filedir)
for excel in path_to_excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel)
df.to_csv(out)
Code 3:
wb = xlrd.open_workbook(combined_xlsx_2)
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open(combined_xlsx_2.replace('.xlsx', '.csv'), 'w')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in range(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
Thank you for your time and assistance!
UPDATE 1:
I was able to accomplish this using 'convert-api'
https://www.convertapi.com/xlsx-to-csv#snippet=python
While not what I had in mind, it will at least get me by. Still hoping there's a better solution for this. Just wanted to share this just in case anyone else had a similar question.

Python transfer excel formatting between two Excel documents

I'd like to copy the formatting between two Excel sheets in python.
Here is the situation:
I have a script that effectively "alters" (ie overwrites) an excel file by opening it using pd.ExcelWriter, then updates values in the rows. Finally, file is overwritten using ExcelWriter.
The Excel file is printed/shared/read by humans between updates done by the code. Humans will do things like change number formatting, turn on/off word wrap, and alter column widths.
My goal is the code updates should only alter the content of the file, not the formatting of the columns.
Is there a way I can read/store/write the sheet format within python so the output file has the same column formatting as the input file?
Here's the basic idea of what I am doing right now:
df_in= pd.read_excel("myfile.xlsx")
# Here is where I'd like to read in format of the first sheet of this file
xlwriter = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
df_out = do_update(df_in)
df_out.to_excel(xlwriter,'sheet1')
# Here is where I'd like to apply the format I read earlier to the sheet
xlwriter.save()
Note: I have played with xlsxwriter.set_column and add_format. As far as I can tell, these don't help me read the format from the current file
Pandas uses xlrd package for parsing Excel documents to DataFrames.
Interoperability between other xlsx packages and xlrd could be problematic when it comes to the data structure used to represent formatting information.
I suggest using openpyxl as your engine when instantiating pandas.ExcelWriter. It comes with reader and writer classes that are interoperable.
import pandas as pd
from openpyxl.styles.stylesheet import apply_stylesheet
from openpyxl.reader.excel import ExcelReader
xlreader = ExcelReader('myfile.xlsx', read_only=True)
xlwriter = pd.ExcelWriter('myfile.xlsx', engine='openpyxl')
df_in = pd.read_excel("myfile.xlsx")
df_out = do_update(df_in)
df_out.to_excel(xlwriter,'sheet1')
apply_stylesheet(xlreader.archive, xlwriter.book)
xlwriter.save()

pandas read excel values not formulas

Is there a way to have pandas read in only the values from excel and not the formulas? It reads the formulas in as NaN unless I go in and manually save the excel file before running the code. I am just working with the basic read excel function of pandas,
import pandas as pd
df = pd.read_excel(filename, sheetname="Sheet1")
This will read the values if I have gone in and saved the file prior to running the code. But after running the code to update a new sheet, if I don't go in and save the file after doing that and try to run this again, it will read the formulas as NaN instead of just the values. Is there a work around that anyone knows of that will just read values from excel with pandas?
That is strange. The normal behaviour of pandas is read values, not formulas. Likely, the problem is in your excel files. Probably your formulas point to other files, or they return a value that pandas sees as nan.
In the first case, the sheet needs to be updated and there is nothing pandas can do about that (but read on).
In the second case, you could solve by setting explicit nan values in read_excel:
pd.read_excel(path, sheetname="Sheet1", na_values = [your na identifiers])
As for the first case, and as a workaround solution to make your work easier, you can automate what you are doing by hand using xlwings:
import pandas as pd
import xlwings as xl
def df_from_excel(path):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path)
df = df_from_excel(path to your file)
If you want to keep those formulas in your excel file just save the file in a different location (book.save(different location)). Then you can get rid of the temporary files with shutil.
I had this problem and I resolve it by moving a graph below the first row I was reading. Looks like the position of the graphs may cause problems.
you can use xlrd to read the values.
first you should refresh your excel sheet you are also updating the values automatically with python. you can use the function below
file = myxl.xls
import xlrd
import win32com.client
import os
def refresh_file(file):
xlapp = win32com.client.DispatchEx("Excel.Application")
path = os.path.abspath(file)
wb = xlapp.Wordbooks.Open(path)
wb.RefreshAll()
xlapp.CalculateUntilAsyncqueriesDone()
wb.save()
xlapp.Quit()
after the file refresh, you can start reading the content.
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_index(0)
for rowid in range(worksheet.nrows):
row = worksheet.row(rowid)
for colid, cell in enumerate(row):
print(cell.value)
you can loop through however you need the data. and put conditions while you are reading the data. lot more flexibility

Delete excel row with Python

I'm doing some testing using python-excel modules. I can't seem to find a way to delete a row in an excel sheet using these modules and the internet hasn't offered up a solution. Is there a way to delete a row using one of the python-excel modules?
In my case, I want to open an excel sheet, read the first row, determine if it contains some valid data, if not, then delete it.
Any suggestions are welcome.
xlwt provides as the module name suggests Excel writer (creation rather than modification) funcionality.
xlrd on the other hand provides Excel reader funcionality.
If your source excel file is rather simple (no fancy graphs, pivot tables, etc.), you should proceed this way:
with xlrd module read the contents of the targeted excel file, and then with xlwt module create new excel file which contains the necessary rows.
If you, however are running this on windows platform , you might be able to manipulate Excel directly through Microsoft COM objects, see old book reference.
I was having the same issue but found a walk around:
Use a custom filter process (Reader>Filter1>Filter2>...>Writer) to generate a copy of the source excel file but with a blank column inserted at the front. Let's call this file augmented.xls.
Then, read augmented.xls into a xlrd.Workbook object, rb, using xlrd.open_workbook().
Use xlutils.copy.copy() to convert rb into a xlwt.Workbook object, wb.
Set the value of the first column of each of the to-be-deleted rows as "x" (or other values as a marker) in wb.
Save wb back to augmented.xls.
Use another custom filter process to generate a resulting excel file from augmented.xls by omitting those rows with "x" in the first column and shifting all columns one column left (equivalent to deleting the first column of markers).
Information and examples of defining a filter process can be found in http://www.simplistix.co.uk/presentations/python-excel.pdf
Hope this help in some way.
You can use the library openpyxl. When opening a file it is both for reading and for writing. Then, with a simple function you can achieve that:
from openpyxl import load_workbook
wb = load_workbook(filename)
ws = wb.active()
first_row = ws[1]
# Your code here using first_row
if first_row not valid:
ws.delete_rows(1, amount=1)

Categories