Excel sheet to Rmarkdown - python

I have a excel workbook which have n number of worksheets in it. Each sheet contains different number of tables in it of different length. so, is there a way that I can convert them into Rmarkdown tables in just one go. The method I currently know is to copy and past the table on some converter and it converts but its static. Is there a way that in R or python I just read the excel file and the tables of excel are converted into Rmarkdown tables. like I don't want to copy and paste each table to convert.

You should consider xlsx package and xlsx::read.xlsx. According to Geza in this answer you should before take care of the workbooks sheets names. So :
wb <- loadWorkbook("path-to-your_xlsx/file.xlsx")
sheets <- getSheets(wb)
namesl <- names(sheets)
Then you can made a data.frame for each sheets, for e.g with read.xlsx or read.xlsx2, like in the following code :
> for(1 in 1:length(namesl)) { assign(paste(namesl[i]),
> # create an object in your env. with assign()
>
> xlsx::read.xlsx("path-of-your-workbook.xlsx", # read a workbook-sheet
> sheetName = paste(namesl[i]), as.data.frame = T, header = T
> # As you like for importing opts, need a check for correct importation.
> )
> }
In order to create a data.frame by sheets in the workbook (each data.frame had the same name as a workbook sheet) and fill it with the content of the sheet.
Excellent day

Related

openpyxl: How to add rows to an existing table (in an existing xlsx file) that doesn't start at 'A1'

I have an Excel 'file.xlsx' file:
with a sheet that has a named excel table, which is somewhere in the middle of the sheet, say C3.
a bunch of charts etc that use this table as source.
I have a pyspark DataFrame that I want to write to this table, so all the charts are updated and I have an Excel report.
I know how to do this using loops to set Cell.value one cell at a time. I'm hoping to find something less tedious.
Unsolved problems:
Update an existing table in an existing xlsx file.
Not lose/delete everything else in the xlsx file being updated.
(preferably) avoid iterating over the input tabular data and update excel cell-by-cell.
Things that didn't work for me:
pyspark.pandas.DataFrame.to_excel() problem: This overwrites the whole 'file.xlsx' and we lose all other sheets / charts etc.
df.toPandas().to_excel('file.xlsx', sheet_name=sheet_name, engine='openpyxl', index=False,
startcol=3, startrow=3)
openpyxl.utils.dataframe.dataframe_to_rows() problem: Starts pasting data at A1. Don't know how to update activeCell or current_row so append() starts from B3 instead of A1.
ws: Worksheet = openpyxl.open('file.xlsx').create_sheet(title=sheet_name)
for r in dataframe_to_rows(df.toPandas(), index=False, header=True):
ws.append(r)
My current solution iter_rows() / iter_cols() / cell_range() / Worksheet.cell() problem: Loops over cell-by-cell.
I've read these and some more:
Appending data to existing tables in openpyxl
Manipulate existing excel table using openpyxl
Writing to row using openpyxl?
Write to an existing excel file using Openpyxl starting in existing sheet starting at a specific column and row
Openpyxl/Pandas - Convert CSV to XLSX
openpyxl convert CSV to EXCEL

Python: How to save excel workbook without ruining dynamic spill/array formulas

Short description of the problem:
I am currently accessing an Excel workbook from Python with openpyxl.
I have some dynamic spill formulas in sheet1, like filter(), byrow() and unique().
With the python script, I am doing some operations in sheet2, but I am not touching sheet1 (where the dynamic spill formulas are located).
When using workbook.save() method in Python, I experience that the dynamic formulas in sheet1 are ruined and static, not having the dynamic functionality they had before interacting with python.
What can I do? Use a parameter in .save()? Use another method?
Detailed description of problem (with pictures):
I have a workbook called Original, with the following three sheets:
nums
dynamic
dump
In "nums" I have a cell for ID (AA), and a column with some numerical values (picture1).
In "dynamic" I have some dynamic formulas like byrow() and filter() that updates automatically with the values in ID and Values-column of "nums" (picture2).
The sheet "dump" is for now empty.
I have a second workbook called Some_data, which have one sheet with a 3-column dataframe (picture3).
I am dumping the 3-column dataframe of Some_data into the empty "dump"-sheet of Original with a Python script, and then using the workbook.save() method to save the new workbook.
The code is here:
import pandas as pd
from openpyxl import load_workbook
Some_data = filepath of the workbook
Original = filepath of the workbook
df = pd.read_excel(Some_data, engine = "openpyxl")
wb = load_workbook(filename = Original)
ws = wb["dump"]
rownr = 2
for index, row in df.iterrows():
ws["B"+str(rownr)] = row["col1"]
ws["C"+str(rownr)] = row["col2"]
ws["D"+str(rownr)] = row["col3"]
rownr+=1
wb.save(filepath of new workbook)
Now, the newly saved workbook's sheet "dump" has now been populated.
The problem is that the dynamic formulas in the sheet "dynamic" has been ruined, although the python script does not interact with any of the sheets "nums" or "dynamic".
First of all - the dynamic array formulas (like filter) now have brackets around them (picture4), and the dynamic array formulas are not dynamic anymore (there are no blue line around the array when selected, and they do not update automatically; picture5).
I need help with what to do. I want to save the excel-file, but with the dynamic array formulas not being ruined.
Thank you for your help, in advance.
Frode

Merge particular sheets from multiple workbooks

I have a folder with 8 workbooks with multiple sheets. I want to rearrange columns from the sheet named RAW from all workbooks and combine all the RAW sheets into one sheet as Final_Raw.
I need a macro code to achieve this also can this be automated using python?
It is possible to do in VBA. You need to collect the data from the sheets. This means you declare all the sheets like:
Sub getdata()
Dim strLocation As String
Dim objWorkbookOne As Workbook
Dim wsData As Worksheet
Dim intFR, intLR As Integer
strLocation = "C:\Users\fredd\Documents\"
Set objWorkbookOne = Workbooks.Open(strLocation & "14082022194559_download_MEDEWERKER.xlsx")
Set wsData = ThisWorkbook.Sheets(1)
wsData.Activate
intFR = 1
intLR = objWorkbookOne.Worksheets("MEDEWERKER").Cells(Rows.Count, 1).End(xlUp).Row
For intFR = 1 To intLR
wsData.Cells(intFR, 1) = objWorkbookOne.Worksheets("MEDEWERKER").Cells(intFR, 1)
Next intFR
End Sub
In the code above we get data from a file named 14082022194559_download_MEDEWERKER.xlsx on location *C:\Users\fredd\Documents*. I made a variable of the location so it is easy to change if nessesary. The file is opened in objWorkbookOne (ofcourse you can do this for eight workbooks as well).
When the workbook is opened, we activate the sheet in which we want to 'paste' the data. Next the first row (intFR) and last Row (intLR) are defined in workbookone. With that FOR loop you can 'copy' the data to the masterfile.
I don't know exactly how your masterfile and other files are build up, so the I have to make assumptions. In this code (above) I copy one column to another column, but this is also possible with ranges.

Convert XLSX to CVS in Python: Keep Values, not Formulas

TLDR: How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves?
My code combines two .xlsx sheets together to generate emails for new org users.
The first .xlsx contains a formula that concatenates the user's name and our domain, while the other .xlsx contains the queried list of new users. When combined, the newly generated .xlsx, titled 'users.xlsx' includes the desired information - but the emails generated are done so using the formula, still - not values. If asked to read data_only via pandas, it doesn't seem to work at all and no emails are generated on this newly created 'users' xlsx sheet.
This is all fine and works well, but the final step is converting the .xlsx over to .csv
Because the emails are technically generated through the concatenating formula, the conversion doesn't preserve the user's emails.
How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves? Is this possible? Can I force the third .xlsx to preserve values only and then do the conversion?
Things I've tried (While they all successfully convert into a .cvs, the data within formulas is lost):
Lenged:
combined_xlsx_2
# The .xlsx product after combining two xlsx (user info + emails)
# This product is 'users.xlsx' - I need it converted to a .csv
Code 1:
# Read and store content
# of an excel file
read_file = pd.read_excel (combined_xlsx_2)
# Write the dataframe object
# into csv file
filedir = combined_xlsx_2.replace("users_2.xlsx","users.csv")
read_file.to_csv (filedir,
index = None,
header=True,
encoding='utf-8')
# read csv file and convert
# into a dataframe object
df = pd.DataFrame(pd.read_csv(filedir))
df
Code 2:
filename = (combined_xlsx_2)
filedir = (filename.replace("/users.xlsx",""))
path_to_excel_files = glob.glob(filedir)
for excel in path_to_excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel)
df.to_csv(out)
Code 3:
wb = xlrd.open_workbook(combined_xlsx_2)
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open(combined_xlsx_2.replace('.xlsx', '.csv'), 'w')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in range(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
Thank you for your time and assistance!
UPDATE 1:
I was able to accomplish this using 'convert-api'
https://www.convertapi.com/xlsx-to-csv#snippet=python
While not what I had in mind, it will at least get me by. Still hoping there's a better solution for this. Just wanted to share this just in case anyone else had a similar question.

Pandas dataframe to Excel with Defined Name range

I want to write multiple df of varying sizes to Excel as my code runs.
Some tables will contain source data, and other tables will contain Excel formulas that operate on that source data.
Rather than tracking the range of cells that I wrote the source data to, I want the formula df to contain an Excel reference to the source data df.
This can be done with Excel's Names or with Excel's Table features.
For example in my formula df I can have =INDEX(my_Defined_Name_source_data, 4,3) * 2 and the Excel Name my_Defined_Name_source_data is all I need to index my source data.
Openpyxl details writing Tables here https://openpyxl.readthedocs.io/en/stable/worksheet_tables.html?highlight=tables
Tables doesn't support the merged cells which a multiindex df.to_excel will create.
So I'm looking at Defined Names instead. There's almost no documentation for writing Defined Names in openpyxl using
wb.defined_names.append()
This is what I've found https://openpyxl.readthedocs.io/en/stable/api/openpyxl.workbook.defined_name.html?highlight=definednames
What I'm asking for help with: How to write a DataFrame to Excel and also give it an Excel Defined Name. Documentation and online examples are almost non existent.
Also gratefully accepting suggestions for alternative ideas since I seem to be accessing something almost nobody else uses.
The "xlsxwriter" library allows you to create an Excel Data Table, so I wrote the following function to take a DataFrame, write it to Excel, and then transform the data to a Data Table.
def dataframe_to_excel_table(df, xl_file, xl_tablename, xl_sheet='Sheet1'):
"""
Pass a dataframe, filename, name of table and Excel sheet name.
Save an excel file of the df, formatted as a named Excel 'Data table'
* Requires "xlsxwriter" library ($ pip install XlsxWriter)
:param df: a Pandas dataframe object
:param xl_file: File name of Excel file to create
:param xl_sheet: String containing sheet/tab name
:param xl_tablename: Data table name in the excel file
:return: Nothing / New Excel file
"""
# Excel doesn't like multi-indexed df's. Convert to 1 value per column/row
# See https://stackoverflow.com/questions/14507794
df.reset_index(inplace=True) # Expand multiindex
# Write dataframe to Excel
writer = pd.ExcelWriter(path=xl_file,
engine='xlsxwriter',
datetime_format='yyyy mm dd hh:mm:ss')
df.to_excel(writer, index=False, sheet_name=xl_sheet)
# Get dimensions of data to size table
num_rows, num_cols = df.shape
# make list of dictionaries of form [{'header' : col_name},...]
# to pass so table doesn't overwrite column header names
# https://xlsxwriter.readthedocs.io/example_tables.html#ex-tables
dataframes_cols = df.columns.tolist()
col_list = [{'header': col} for col in dataframes_cols]
# Convert data in Excel file to an Excel data table
worksheet = writer.sheets[xl_sheet]
worksheet.add_table(0,0, # begin in Cell 'A1'
num_rows, num_cols-1,
{'name': xl_tablename,
'columns': col_list})
writer.save()
I fixed this by simply switching from OpenPyXL to XLSXWriter
https://xlsxwriter.readthedocs.io/example_defined_name.html?highlight=names

Categories