Efficient way of exporting large R dataset to excel - python

As title, I have a dataset with about 13000 rows and 255 columns (actually I have more than 255 columns but RODBC package seems to limit the number of columns exported to 255, so I trimmed it a bit) that need to be exported to xls/xlsx file.
I tried RODBC and xlsx package, both takes more than 5 minutes for export. I wonder if there is any other more efficient way of doing this?
I knew a little bit of python (using python to connect to outlook for listing emails in mailbox), if there is way for export using python instead, it is welcomed also.
update 01
Quite a few suggested using csv, it may not very possible in my case because there is a field containing free text that I cannot control what kind of character is entered in that field, making selection of separator difficult.
update 02
thank you for the suggestions, but I found that the R packages are fine only if the dataframe is relatively small and it is even slow for dataframe with all columns being character. Any suggestions?

There are lots of options:
Use xlsx with mutliple sheets (you've tried this and it's too slow, I know)
Use write.csv should be faster and it's readable by Excel
Use odbcConnectExcel2007 within RODBC
Use the package bigmemory to help you manage the large dataframe, especially if you can make it into a sparse matrix
XLConnect which worked for this guy with the same problem
Write it to a SQL datatabase with RODBC or RPostgreSQL, etc and then make a connection to the DB within Excel. I do this a lot. Here's a related resource.
Use Pandas
Create a tab-delimited text file and then import it to Excel: write.table (table,sep="\t",quote=FALSE,row.names=FALSE,file=file.name)
Use fread
Try a cloud-based solution (I'm not sure if this will actually be faster, but it would at least be a trendy solution with extra benefits such as providing a nice way to store your data safely and let you query whatever you need from it using Excel on any computer)
RExcel
XLLoop
Finally, here's a nice little article on "A Million Ways to Connect R and Excel" which you may find useful, though I think I've actually given you more options than the article does.
I would start with the most simple solutions, like fread, then work your way to the relatively more complicated solutions if you're still not getting the results you want.
Depending on the exact nature of your project, you might even benefit from parallelism or multicore processing. Those don't boost your I/O speed in most cases, but it could speed up any processing/transformation of your data which takes place in your process, thus making your overall data pipeline faster.
Python is also very well-equipped to handle this problem, but there are so many solutions within R, hopefully you won't need to resort to switching languages just to write out data. Still, you could try
XlsxWriter in Constant Memory mode, or
Optimized Reader and Writer of the openpyxl package
if you want to try a Python-based solution.

try to use openxlsx package its quite fast.
https://cran.r-project.org/web/packages/openxlsx/openxlsx.pdf
Install package openxlsx
load the library openxlsx
use write.xlsx() or writeData() command to write into xlsx file
A small example of basic operations using openxlsx library
taken from openxlsx documentation
`## setup a workbook with 3 worksheets
wb <- createWorkbook()
addWorksheet(wb = wb, sheetName = "Sheet 1", gridLines = FALSE)
writeDataTable(wb = wb, sheet = 1, x = iris)
addWorksheet(wb = wb, sheetName = "mtcars (Sheet 2)", gridLines = FALSE)
writeData(wb = wb, sheet = 2, x = mtcars)
addWorksheet(wb = wb, sheetName = "Sheet 3", gridLines = FALSE)
writeData(wb = wb, sheet = 3, x = Formaldehyde)
worksheetOrder(wb)
names(wb)
worksheetOrder(wb) <- c(1,3,2) # switch position of sheets 2 & 3
writeData(wb, 2, 'This is still the "mtcars" worksheet', startCol = 15)
worksheetOrder(wb)
names(wb) ## ordering within workbook is not changed
saveWorkbook(wb, "worksheetOrderExample.xlsx", overwrite = TRUE)
worksheetOrder(wb) <- c(3,2,1)
saveWorkbook(wb, "worksheetOrderExample2.xlsx", overwrite = TRUE)`
Gani

Related

Appending Excel cell values using pandas

Edit: I found out a solution to my question. More or less look at the user manual for openPyxl instead of online tutorials, the tutorials ran errors when I tried them (I tried more than one) and their thought process was significantly different from the thought process in the user manual. And also I ended up not using pandas as much as I thought I would need to.
I am trying to append certain values in an Excel file with multiple sheets based on user inputs and then rewrite it to the Excel file (without deleting the rest of the sheets). So far I have tried this which seems to combine the data but I didn't quite see how it applied to what I am doing since I want to append a part of a sheet instead of rewrite the whole excel file. I have also tried a few other things with ExcelWriter but I don't quite understand it since it usually wipes all the data in the file (I may be using it wrong).
episode_dataframe = pd.read_excel (r'All_excerpts (Siena Copy)_test.xlsx', sheet_name=episode)
#episode is a specified string inputted by user, this line makes a data frame for the specified sheet
episode_dataframe.loc[(int(pass_num) - 1), 'Resources'] = resources
#resources is also a user inputted string, it's what I am trying to append the spreadsheet cell value to, this appends to corresponding data frame
path_R = open("All_excerpts (Siena Copy)_test.xlsx", "rb")
with pd.ExcelWriter(path_R) as writer:
writer.book = openpyxl.load_workbook(path_R)
#I copied this from [here][3], i think it should make the writer for the to_excel? I don't fully know
episode_dataframe.to_excel(writer, sheet_name=episode, engine=openpyxl, if_sheet_exsits ='replace')
#this should write the sheet data frame onto the file, but I don't want it to delete the other sheets
Additionally, I have been running into a bunch of other smaller errors, a big one was Workbook' object has no attribute 'add worksheet' even though I'm not trying to add a worksheet, also I could not get their solution to work.
I am a bit of a novice at python, so my code might be a bit of a mess.

How can I export variables from .mat file (generated by Dymola) to .csv using python

I'm a student who is quite new to coding in Python.
I'm using Dymola for several years and now I'm using the Dymola/Python interface with which you can operate Dymola from inside Python (useful for building stock simulations, global sensitivity analysis etc.).
Now, Dymola always generates .mat files in an efficient unreadable data structure. I was wondering how to export variables I'm interested in from that .mat-file to .csv using a Python-script? (I don't want the whole file to be converted to .csv because it is simple way too large)
I'm aware of a DyMat-package for Python that should do the job but either I don't understand the code or the code is not doing what it should do? Does anybody have experience with this?
I probably miss some code defining which .mat file has to be read/exported from, which variables I want and in which directory the result.csv-file should be stored?
import csv, numpy
def export(dm, varList, fileName=None, formatOptions={}):
"""Export DyMat data to a CSV file"""
if not fileName:
fileName = dm.fileName + '.csv'
oFile = open(fileName, 'w')
csvWriter = csv.writer(oFile)
vDict = dm.sortByBlocks(varList)
for vList in vDict.values():
vData = dm.getVarArray(vList)
vList.insert(0, dm._absc[0])
csvWriter.writerow(vList)
csvWriter.writerows(numpy.transpose(vData))
oFile.close()
Thanks!
In the Dymola distribution there is a utility called alist.exe, that allows you to export a number of variables in CSV format.
Another possibility is to convert the MAT file to SDF format, which is a very simple HDF5 interpretation. The HDF5 file is not as compact as the MAT-file, but you can compress the file using ZIP/GZIP/7ZIP to reduce archival storage. There are both MATLAB and Python scripts for reading the SDF format in the Dymola distribution.
Since this was tagged openmodelica, I am proposing a solution using it:
filterSimulationResults("file.mat", "file.csv", {"x","y","z"}) creates a csv-file with only variables x, y, z (If you think it's still too large, it is possible to resample the file).
For small files (<2GB) Buildingspy (or other Python packages) covers pretty much all needs: https://simulationresearch.lbl.gov/modelica/buildingspy/
However, since one will run into issues when the files are above 2GB (e.g. for full years of simulations), "alist.exe" from Dymola may be employed. (filterSimulationResults from OpenModelica also fails then)
"alist.exe" seems to accept up until approx. 100 variables to be exported at once and single executions for each variable seems to slow things down drastically (translation of 1 or 100 rows takes almost the same time). One may employ the alist.exe as follows from Python to facilitate automation and speed things up.
var_list=['Component.Name1','Component.Name3','Component.Name2','...'] #List of Variabels to be extracted
N_batch=100 #Number of variables to be extracted from the .mat file at once (max. approx 110)
cmds=[] #list of commands to be executed batch wise
for i,var in enumerate(var_list):
if (i%N_batch == 0) &(i > 0):
cmds.append(cmd)
cmd=''
cmd+=f' -e {var}'#build command
cmds.append(cmd)
lst_df=[] #list of pandas dataframes
for i,cmd in enumerate(cmds):
os.system(f'"C:/Program Files/Dymola 2021/bin64/alist.exe" {cmd} {inFile} tmp.csv')
lst_df.append(pd.read_csv('tmp.csv',index_col=[0]).squeeze("columns"))
df_overall=pd.concat(lst_df,axis=1)
df_overall.to_csv('CompleteCSVFile.csv')#or use .pkl for more efficient writing and reading
It is still not a fast solution, but enables the processing of the date in the first instance. Variable Selection of Dymola should always be exploited first before trying to wrangle around such amounts of data on a local machine.
Hope this helps someone someday!

Calculating Excel sheets without opening them (openpyxl or xlwt)

I made a script that opens a .xls file, writes a few new values in it, then saves the file.
Later, the script opens it again, and wants to find the answers in some cells which contain formulas.
If I call that cell with openpyxl, I get the formula (ie: "=A1*B1").
And if I activate data_only, I get nothing.
Is there a way to let Python calculate the .xls file? (or should I try PyXll?)
I realize this question is old, but I ran into the same problem and extensive searching didn't produce an answer.
The solution is in fact quite simple so I will post it here for posterity.
Let's assume you have an xlsx file that you have modified with openpyxl. As Charlie Clark mentioned openpyxl will not calculate the formulas, but if you were to open the file in excel the formulas would be automatically calculated. So all you need to do is open the file and then save it using excel.
To do this you can use the win32com module.
import win32com.client as win32
excel = win32.gencache.EnsureDispatch('Excel.Application')
workbook = excel.Workbooks.Open(r'absolute/path/to/your/file')
# this must be the absolute path (r'C:/abc/def/ghi')
workbook.Save()
workbook.Close()
excel.Quit()
That's it. I've seen all these suggestions to use Pycel or Koala, but that seems like a bit of overkill if all you need to do is tell excel to open and save.
Granted this solution is only for windows.
There is actually a project that takes Excel formulas and evaluates them using Python: Pycel. Pycel uses Excel itself (via COM) to extract the formulas, so in your case you would skip that part. The project probably has something useful that you can use, but I can't vouch for its maturity or completeness. It was not really developed for the general public.
There is also a newer project called Koala which builds on both Pycel and OpenPyXL.
Another approach, if you can't use Excel but you can calculate the results of the formulas yourself (in your Python code), is to write both the value and the formula into a cell (so that when you read the file, you can just pull the value, and not worry about the formula at all). As of this writing, I haven't found a way to do it in OpenPyXL, but XlsxWriter can do it. From the documentation:
XlsxWriter doesn’t calculate the value of a formula and instead stores the value 0 as the formula result. It then sets a global flag in the XLSX file to say that all formulas and functions should be recalculated when the file is opened. This is the method recommended in the Excel documentation and in general it works fine with spreadsheet applications. However, applications that don’t have a facility to calculate formulas, such as Excel Viewer, or some mobile applications will only display the 0 results.
If required, it is also possible to specify the calculated result of the formula using the options value parameter. This is occasionally necessary when working with non-Excel applications that don’t calculate the value of the formula. The calculated value is added at the end of the argument list:
worksheet.write_formula('A1', '=2+2', num_format, 4)
With this approach, when it's time to read the value, you would use OpenPyXL's data_only option. (For other people reading this answer: If you use xlrd, then only the value is available anyway.)
Finally, if you do have Excel, then perhaps the most straightforward and reliable thing you can do is automate the opening and resaving of your file in Excel (so that it will calculate and write the values of the formulas for you). xlwings is an easy way to do this from either Windows or Mac.
The formula module works for me. For detail please refer to https://pypi.org/project/formulas/
from openpyxl import load_workbook
import formulas
#The variable spreadsheet provides the full path with filename to the excel spreadsheet with unevaluated formulae
fpath = path.basename(spreadsheet)
dirname = path.dirname(spreadsheet)
xl_model = formulas.ExcelModel().loads(fpath).finish()
xl_model.calculate()
xl_model.write(dirpath=dirname)
#Use openpyxl to open the updated excel spreadsheet now
wb = load_workbook(filename=spreadsheet,data_only=True)
ws = wb.active
I run into the same problem, and after some time researching I ended up using pyoo ( https://pypi.org/project/pyoo/ ) which is for openoffice/libreoffice so available in all platforms and is more straightforward since communicates natively and doesn't require to save/close the file . I tried several other libraries but found the following problems
xlswings: Only works if you have Excel installed and Windows/MacOS so I couldn't evaluate
koala : Seems that it's broken, after networkx 2.4 update.
openpyxl: As pointed out by others, it isn't able to calculate formulas so I was looking into combining it with pycel to get values. I didn 't finally tried because I found pyoo . Openpyxl+pycel might not work as of now, since pycel is also relying on networkx library.
No, and in openpyxl there will never be. I think there is a Python library that purports to implements an engine for such formualae which you can use.
xlcalculator can do this job. https://github.com/bradbase/xlcalculator
from xlcalculator import ModelCompiler
from xlcalculator import Model
from xlcalculator import Evaluator
filename = r'use_case_01.xlsm'
compiler = ModelCompiler()
new_model = compiler.read_and_parse_archive(filename)
evaluator = Evaluator(new_model)
# First!A2
# value is 0.1
#
# Fourth!A2
# formula is =SUM(First!A2+1)
val1 = evaluator.evaluate('Fourth!A2')
print("value 'evaluated' for Fourth!A2:", val1)
evaluator.set_cell_value('First!A2', 88)
# now First!A2 value is 88
val2 = evaluator.evaluate('Fourth!A2')
print("New value for Fourth!A2 is", val2)
Which results in the following output;
file_name use_case_01.xlsm ignore_sheets []
value 'evaluated' for Fourth!A2: 1.1
New value for Fourth!A2 is 89

Pywin32 Excel Formatting

I want to write to an Excel sheet via pywin32. I can do it actually without problem. But I couldnt format a range of cells in sheet. I want to align the values centerly inside cells. And also i need to fill the cells with color. How can I do it?
Thanks in advance.
I've not specifically done this using python before, but I'm assuming you're using the COM automation interface to excel.
This page has an example that seems to cover both alignment and filling cells with colour in C#, so it should be fairly easy to adapt to python. Assuming you have a Worksheet object called sheet, and the Excel automation object is called Excel, I'm guessing it might look a bit like this:
//Format A1:D1 as center alignment,
sheet.Range("A1", "D1").VerticalAlignment = Excel.XlVAlign.xlVAlignCenter
sheet.Range("A1", "D1").HorizontalAlignment = Excel.XlHAlign.xlHAlignCenter
sheet.Range("A1", "D1").Interior.ColorIndex = Excel.XlColorIndex.Red
If you don't have access to the Excel.XlAlign and XlColorIndex constants from python then you can just replace them with the specific integers they represent, though I'm not entirey sure where you could get them from. Probably from a VBA Reference Site or similar. (Though that link I provided doesn't seem to allow you to expand each of the entries in the list, so you may need to look elsewhere)
EDIT: Just had a play about with excel automation via the python console, and it seems to work alright:
>>> from win32com.client import Dispatch
>>> xlApp = Dispatch("Excel.Application")
>>> xlWb = xlApp.Workbooks.Add()
>>> xlSht = xlWb.WorkSheets(1)
>>> xlSht.Range("A1", "D1").VerticalAlignment = 1
>>> xlSht.Range("A1", "D1").Interior.ColorIndex = 6
>>> # The background color of A1-D1 should now be yellow
>>> xlSht.Cells(1, 1).VerticalAlignment = 1
If you can't find any good reference on what the various alignment/colour constants are, then I'd just play about with python on the console like this, then open the resulting worksheet in excel and have a look at the results to figure things out.
You can find the official reference for the office 2003 automation API here
Specifically, you'll probably find the range documentation most usefull.

Is there any way to edit an existing Excel file using Python preserving formulae?

I am trying to edit several excel files (.xls) without changing the rest of the sheet. The only thing close so far that I've found is the xlrd, xlwt, and xlutils modules. The problem with these is it seems that xlrd evaluates formulae when reading, then puts the answer as the value of the cell. Does anybody know of a way to preserve the formulae so I can then use xlwt to write to the file without losing them? I have most of my experience in Python and CLISP, but could pick up another language pretty quick if they have better support. Thanks for any help you can give!
I had the same problem... And eventually found the next module:
from openpyxl import load_workbook
def Write_Workbook():
wb = load_workbook(path)
ws = wb.get_sheet_by_name("Sheet_name")
c = ws.cell(row = 2, column = 1)
c.value = Some_value
wb.save(path)
==> Doing this, my file got saved preserving all formulas inserted before.
Hope this helps!
I've used the xlwt.Formula function before to be able to get hyperlinks into a cell. I imagine it will also work with other formulas.
Update: Here's a snippet I found in a project I used it in:
link = xlwt.Formula('HYPERLINK("%s";"View Details")' % url)
sheet.write(row, col, link)
As of now, xlrd doesn't read formulas. It's not that it evaluates them, it simply doesn't read them.
For now, your best bet is to programmatically control a running instance of Excel, either via pywin32 or Visual Basic or VBScript (or some other Microsoft-friendly language which has a COM interface). If you can't run Excel, then you may be able to do something analogous with OpenOffice.org instead.
We've just had this problem and the best we can do is to manually re-write the formulas as text, then convert them to proper formulas on output.
So open Excel and replace =SUM(C5:L5) with "=SUM(C5:L5)" including the quotes. If you have a double quote in your formula, replace it with 2 double quotes, as this will escape it, so = "a" & "b" becomes "= ""a"" & ""b"" ")
Then in your Python code, loop over every cell in the source and output sheets and do:
output_sheet.write(row, col, xlwt.ExcelFormula.Formula(source_cell[1:-1]))
We use this SO answer to make a copy of the source sheet to be the output sheet, which even preserves styles, and avoids overwriting the hand written text formulas from above.

Categories