having issues formatting output to excel from dataframes, using xlsxwriter - python

I have a series of SQL database queries, that I am writing to Excel, using Xlswriter/Pandas.
I am using a simple global format, for font type and size.
Each table is different, so the only thing I want to do, is present a standard font and size.
format = workbook.add_format()
format.set_font_size(9)
format.set_font_name='Calibri'
for col_name in df1:
column_width=max(df1[col_name].astype(str).map(len).max(),len(col_name))
col_idx=df1.columns.get_loc(col_name)
if col_idx < 4:
column_width=column_width + 1
worksheet1.set_column(col_idx,col_idx,column_width,format)
writer.save()
This all work well, until I encounter a DATE.
There may be multiple date fields, or no date field in each Excel table
All the fonts in the output are 9, except the date field. All the date fields are showing up as 11 and I don't know how to resolve the issue:
Also,
the dates themselves, show up in Excel as Date-time, not Date, even though they are defined in the Database as a Date field. Converting them is also an issue. I cant seem to get rid of the Time portion.
Any help would be greatly appreciated. I have spent waay to much time on this.

Sounds to me like this problem.
If you have existing files, create a template excel file with in the correct format and use python to just fill the cells. (This scenario is the accepted answer in the post). You can also define a certain style in excel once and apply it to columns
Apparently, you have a somehow more complicated scenario. The second answer proposes to adjust the data in pandas before writing it to excel.
However, my personal guess is that it is rather formatting problem of excel so your approach seems reasonable. How about specifying the format explicitly: format = workbook.add_format({'num_format': 'yyyy-mm-dd'})? (Which youl rather align with this post. Try specifying the font height if setting the global font size does not work: 'height': 9*20 (note that you need to scale the height by 20 to use "points" as unit)

Related

Pandas read_excel: How to preserve cell format information for currency and percent

[ 10-07-2022 - For anyone stopping by with the same issue. After much searching, I have yet to find a way, that isn't convoluted and complicated, to accurately pull mixed type data from excel using Pandas/Python. My solution is to convert the files using unoconv on the command line, which preserves the formatting, then read into pandas from there. ]
I have to concatenate 1000s of individual excel workbooks with a single sheet, into one master sheet. I use a for loop to read them into a data frame, then concatenate the data frame to a master data frame. There is one column in each that could represent currency, percentages, or just contain notes. Sometimes it has been filled out with explicit indicators in the cell, Eg., '$' - other times, someone has used cell formatting to indicate currency while leaving just a decimal in the cell. I've been using a formatting routine to catch some of this but have run into some edge cases.
Consider a case like the following:
In the actual spreadsheet, you see: $0.96
When read_excel siphons this in, it will be represented as 0.96. Because of the mixed-type nature of the column, there is no sure way to know whether this is 96% or $0.96
Is there a way to read excel files into a data frame for analysis and record what is visually represented in the cell, regardless of whether cell formatting was used or not?
I've tried using dtype="str", dtype="object" and have tried using both the default and openpyxl engines.
UPDATE
Taking the comments below into consideration, I'm rewriting with openpyxl.
import openpyxl
from openpyxl import load_workbook
def excel_concat(df_source):
df_master = pd.DataFrame()
for index, row in df_source.iterrows():
excel_file = Path(row['Test Path']) / Path(row['Original Filename'])
wb = openpyxl.load_workbook(filename = excel_file)
ws = wb.active
df_data = pd.DataFrame(ws.values)
df_master = pd.concat([df_master, df_data], ignore_index=True)
return df_master
df_master1 = excel_concat(df_excel_files)
This appears to be nothing more than a "longcut" to just calling the openpyxl engine with pandas. What am I missing in order to capture the visible values in the excel files?
looking here,https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html , noticed the following
dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. **If converters are specified, they will be applied INSTEAD of dtype conversion.**
converters dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
Do you think that might work for you?

How do I stop Excel from converting numbers to date?

I save my DataFrame as csv and try to open it in excel, problem is that excel converts some of my float data to date format. I use excel 2016.
This is how my DataFrame looks like in excel.
Does anyone have an idea how to stop this ?
You have to select the required column and then press CNT + 1 and then select the correct format. As you are saving the file as CSV, you have to repeat this action every time you open the file as CSV don't save such information and by default excel reads everything as generic format. You can find more details here
If you use Excel to open a CSV file it will attempt to interpret each cell. Something that resembles a date will be formatted as a date. Excel has the same behaviour if you type or paste something that resembles a date into a cell formatted as General.
However, if you paste the same data into a cell that has already been formatted other than General it will no longer be re-interpreted.
Format a blank Excel sheet as you expect the data to appear. Open the CSV file in a text editor such as Notepad. Copy the data then paste it into the Excel sheet.
If you aren't sure how the data should appear, for example because you aren't sure about the number of columns, you can format all of the cells as Text. That will suppress interpretation but you can change the formatting afterwards.
Incidentally, I discovered a bug in Excel that relates to this. When you add a new row to the bottom of a table it inherits the formatting of the row above, however Excel does this in the wrong order. To see this, format a table column as Text. In the row below the last row of the table, formatted General, type '1/1/2022'. Excel misinterprets this as 44562. That is because it interpreted 1/1/2022 as a date then changed the formatting to Text to match the row above.
Consequently, when applying the initial formatting you should select at least as many rows as in your CSV file. The easiest way to achieve this is simply to format entire columns.
In your particular case you probably want to pre-format certain columns as Number.

XLSXWriter Format multiple rows

Trying to do something that should be simple. XLSXWriter has a function set_column that lets me format multiple columns at the same time:
worksheet.set_column('B:D', 30, align)
However, there is no such row function, as
worksheet.set_row(5,None, percentage)
operates on but one row at a time.
I've tried doing the following to no avail:
worksheet.conditional_format('C27:W27', {'format': percentage})
How can I simply set cells C27:W27 to be percentage format?
Unfortunately, a method such as worksheet.set_row(5,None, percentage) does not exist. I ended up scrapping this project.

python xlsxwriter, set border dynamically

I am using python's xlsxwriter package to format the excel report that I am generating through a mysql query.
The problem is that the report generated by sql the returns the columns dynamically so their is no way of knowing how many column will be returned before hand. I am trying to set border only to the returned number of columns. But so far I am only able to hard code the number of columns(A:DC). Can anyone help me with this, I am using the following query-
worksheet = writer.sheets['Sheet1']
formater = workbook.add_format({'border':1})
worksheet.set_column('A:DC',15,formater)
writer.save()
Set the range dynamically based on the length of the data you receive
data = [...]
worksheet.set_column(0, len(data), 15, formater)
set_column() docs for the reference.

Is it possible to specify Date XFStyle without an exact format with xlwt?

I'm using Python-Excel xlwt to create a blank Excel spreadsheet for filling out in a spreadsheet. I would like to specify that a certain range of cells should be date formatted. I'm doing something like:
datestyle = xlwt.XFStyle()
datestyle.num_format_str = "YYYY-MM-DD"
ws.write(row, column, "", datestyle)
but that's a bit over-prescriptive. People may be pasting in data, and that means that if the format doesn't match exactly then there will be problems. Spreadsheets are generally good at spotting and understanding dates pasted in in various formats. I want the spreadsheet to be able to do this without the restriction of a specific input format.
I just want to say 'this cell is a date' and not impose a format. Is this doable?
You can't specify that a cell is a date and not impose a format, not with xlwt and not with anything else, including Excel itself. Two reasons:
(1) You can't specify that a cell is any type. It is whatever the user types or pastes in. You can format it as a date but they can type in text.
(2) "date" is not a data type in Excel. All Excel knows about is text, floating point numbers, booleans (TRUE/FALSE), errors (#DIV/0 etc), and "blank" (formatting but no data). A date cell is just a number cell with a date format.
A general answer to "Can I do X with xlwt?" questions: Firstly try doing X with Excel / OpenOffice Calc / Gnumeric. If you can't, then neither can xlwt.
The format you're prescribing defines how the date will be displayed, but it won't affect how excel interprets entered dates. If your format is YYYY-MM-DD, the user can still enter 5/21/2008, and the string will be converted to it's date value (39589) and then displayed in your specified format: "2008-01-21"

Categories