TLDR: How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves?
My code combines two .xlsx sheets together to generate emails for new org users.
The first .xlsx contains a formula that concatenates the user's name and our domain, while the other .xlsx contains the queried list of new users. When combined, the newly generated .xlsx, titled 'users.xlsx' includes the desired information - but the emails generated are done so using the formula, still - not values. If asked to read data_only via pandas, it doesn't seem to work at all and no emails are generated on this newly created 'users' xlsx sheet.
This is all fine and works well, but the final step is converting the .xlsx over to .csv
Because the emails are technically generated through the concatenating formula, the conversion doesn't preserve the user's emails.
How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves? Is this possible? Can I force the third .xlsx to preserve values only and then do the conversion?
Things I've tried (While they all successfully convert into a .cvs, the data within formulas is lost):
Lenged:
combined_xlsx_2
# The .xlsx product after combining two xlsx (user info + emails)
# This product is 'users.xlsx' - I need it converted to a .csv
Code 1:
# Read and store content
# of an excel file
read_file = pd.read_excel (combined_xlsx_2)
# Write the dataframe object
# into csv file
filedir = combined_xlsx_2.replace("users_2.xlsx","users.csv")
read_file.to_csv (filedir,
index = None,
header=True,
encoding='utf-8')
# read csv file and convert
# into a dataframe object
df = pd.DataFrame(pd.read_csv(filedir))
df
Code 2:
filename = (combined_xlsx_2)
filedir = (filename.replace("/users.xlsx",""))
path_to_excel_files = glob.glob(filedir)
for excel in path_to_excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel)
df.to_csv(out)
Code 3:
wb = xlrd.open_workbook(combined_xlsx_2)
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open(combined_xlsx_2.replace('.xlsx', '.csv'), 'w')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in range(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
Thank you for your time and assistance!
UPDATE 1:
I was able to accomplish this using 'convert-api'
https://www.convertapi.com/xlsx-to-csv#snippet=python
While not what I had in mind, it will at least get me by. Still hoping there's a better solution for this. Just wanted to share this just in case anyone else had a similar question.
Related
I'm using psycopg2 to query a database and return a table of data. I need to then write that data to a .xlsx file.
I had it writing to a .csv really nicely using:
with open("file_name.csv", "w") as file:
csv_writer = writer(file)
csv_writer.writerow(headers)
csv_writer.writerows(data)
This works fine, the only issue is that I now need to open the .csv and save as a new .xlsx so its a step I want to cut out.
I'm trying to use pandas:
df = pandas.DataFrame(data, columns=headers)
df.to_excel("file_name.xlsx")
But all numbers are being stored as text so I now need to go back in and refresh the cells for excel to realise its an integer or float?
Also tried with openpyxl, this works better but still stored the date column as text so I still need to go in and refresh the cells for excel to recognise it as a date.
I thought it might have been an issue with how psycopg2 pulls the data but its not an issue for .csv so why is it a problem for .xlsx? This is probably just my lack of understanding the difference between the two files types. Does anyone have a solution for saving as a .xlsx but retaining all the correct formatting?
When creating the DataFrame you can specify "converters" for excel:
converters = {
'name': str,
'ages': int,
'score': float
}
df = pandas.DataFrame(data, columns=headers, converters=converters)
df.to_excel("file_name.xlsx")
Where the keys of the converters dict are the column names in the DataFrame.
I've tried a few methods, including pandas:
df = pd.read_excel('file.xlsx')
df.to_csv('file.csv')
But every time I convert my xlsx file over to csv format, I lose all data within columns that include a formula. I have a formula that concatenates values from two other cells + '#domain' to create user emails, but this entire column returns blank in the csv product.
The formula is basically this:
=CONCATENATE(B2,".",E2,"#domain")
The conversion is part of a larger code workflow, but it won't work if this column is left blank. The only thing I've tried that worked was this API, but I'd rather not pay a subscription if this can be done locally on the machine.
Any ideas? I'll try whatever you throw at me - bear in mind I'm new to this, but I will do my best!
You can try to open the excel file with the openpyxl library in the data-only mode. This will prevent the raw formulas - they are going to be calculated just the way you see them in excel itself.
import openpyxl
wb = openpyxl.load_workbook(filename, data_only=True)
Watch out when youre working with you original file and save it with the openpyxl-lib in the data-only-mode all your formulas will be lost. I had this once and it was horrible. So i recommend using a copy of your file to work with.
Since you have your xlsx-file with values only you can now use the internal csv library to generate a proper csv-file (idea from this post: How to save an Excel worksheet as CSV):
import csv
sheet = wb.active # was .get_active_sheet()
with open('test.csv', 'w', newline="") as f:
c = csv.writer(f)
for r in sheet.iter_rows(): # generator; was sh.rows
c.writerow([cell.value for cell in r])
I am having much trouble trying to read in a large excel file (.xlsx), and write some of its tabs/sheets to a smaller excel file.
In one class, I get return a dict of dataframes. The key is the respective sheet/tab that the dataframe came from, type string. The value is the actual dataframe, with all of its original columns, type DataFrame. In this class, I extract certain dataframes from the original excel file.
I am able to print out my key:value pairs after extracting the dataframes of my choice, and it all looks fine. However, I believe that my real problem is writing the actual data to 1 excel file. I only get the first dataframe, without the sheet name that it came from (it becomes the default 'Sheet1'), and nothing else.
Here is the code that writes my dict to an excel file:
def frames_to_excel(self, df_dict, path):
"""Write dictionary of dataframes to separate sheets, within
1 file."""
writer = pd.ExcelWriter(path, engine='xlsxwriter')
for tab_name, dframe in df_dict.items():
dframe.to_excel(writer, sheet_name=tab_name)
writer.save()
- "path" is the select output path to write the whole dict to a xlsx fle.
- "df_dict" is the dict of dataframes.
I am very sorry for the confusion. My bug was not at all in the code I posted, or any of the classes that parse the data from the original excel file. The problem was this line of code:
excel_path = re.sub(r"(?i)original|_original", "_custom", os.path.basename(excel_path))
By doing the basename function from the os library, I was only using the file name, instead of the entire full path:
writer = pd.ExcelWriter(excel_path, engine='xlsxwriter')
-Therefore, I was not writing the correct data to the full path, and I was looking at old data from my programs output, from about 5 days ago. Thanks for everyones help.
The fix (use the proper full path that you expect):
excel_path = re.sub(r"(?i)original|_original", "_custom", excel_path)
I want to write multiple df of varying sizes to Excel as my code runs.
Some tables will contain source data, and other tables will contain Excel formulas that operate on that source data.
Rather than tracking the range of cells that I wrote the source data to, I want the formula df to contain an Excel reference to the source data df.
This can be done with Excel's Names or with Excel's Table features.
For example in my formula df I can have =INDEX(my_Defined_Name_source_data, 4,3) * 2 and the Excel Name my_Defined_Name_source_data is all I need to index my source data.
Openpyxl details writing Tables here https://openpyxl.readthedocs.io/en/stable/worksheet_tables.html?highlight=tables
Tables doesn't support the merged cells which a multiindex df.to_excel will create.
So I'm looking at Defined Names instead. There's almost no documentation for writing Defined Names in openpyxl using
wb.defined_names.append()
This is what I've found https://openpyxl.readthedocs.io/en/stable/api/openpyxl.workbook.defined_name.html?highlight=definednames
What I'm asking for help with: How to write a DataFrame to Excel and also give it an Excel Defined Name. Documentation and online examples are almost non existent.
Also gratefully accepting suggestions for alternative ideas since I seem to be accessing something almost nobody else uses.
The "xlsxwriter" library allows you to create an Excel Data Table, so I wrote the following function to take a DataFrame, write it to Excel, and then transform the data to a Data Table.
def dataframe_to_excel_table(df, xl_file, xl_tablename, xl_sheet='Sheet1'):
"""
Pass a dataframe, filename, name of table and Excel sheet name.
Save an excel file of the df, formatted as a named Excel 'Data table'
* Requires "xlsxwriter" library ($ pip install XlsxWriter)
:param df: a Pandas dataframe object
:param xl_file: File name of Excel file to create
:param xl_sheet: String containing sheet/tab name
:param xl_tablename: Data table name in the excel file
:return: Nothing / New Excel file
"""
# Excel doesn't like multi-indexed df's. Convert to 1 value per column/row
# See https://stackoverflow.com/questions/14507794
df.reset_index(inplace=True) # Expand multiindex
# Write dataframe to Excel
writer = pd.ExcelWriter(path=xl_file,
engine='xlsxwriter',
datetime_format='yyyy mm dd hh:mm:ss')
df.to_excel(writer, index=False, sheet_name=xl_sheet)
# Get dimensions of data to size table
num_rows, num_cols = df.shape
# make list of dictionaries of form [{'header' : col_name},...]
# to pass so table doesn't overwrite column header names
# https://xlsxwriter.readthedocs.io/example_tables.html#ex-tables
dataframes_cols = df.columns.tolist()
col_list = [{'header': col} for col in dataframes_cols]
# Convert data in Excel file to an Excel data table
worksheet = writer.sheets[xl_sheet]
worksheet.add_table(0,0, # begin in Cell 'A1'
num_rows, num_cols-1,
{'name': xl_tablename,
'columns': col_list})
writer.save()
I fixed this by simply switching from OpenPyXL to XLSXWriter
https://xlsxwriter.readthedocs.io/example_defined_name.html?highlight=names
I have an excel sheet and I am reading the excel sheet using pandas in python.
Now I want to read the excel file based on a column, if the column has some value then do not read that row, if the column is empty than read that and store the values in a list.
Here is a screenshot
Excel Example
Now in the above image when the uniqueidentifier is yes then it should not read that value, but if it is empty then it should start reading from that value.
How to do that using python and how to get index so that after I have performed some function that I am again able to write to that blank unique identifier column saying that row has been read
This is possible for csv files. There you could do
iter_csv = pandas.read_csv('file.csv', iterator=True, chunksize=100000)
df = pd.concat([chunk[chunk['UniqueIdentifier'] == 'True'] for chunk in iter_csv])
But pd.read_excel does not offer to return an iterator object, maybe some other excel-readers can. But I don't no which ones. Nevertheless you could export your excel file as csv and use the solution for csv files.