Python: Creating Excel worksheets with charts - python

Is there any module for creating Excel charts with embedded charts in Python? The modules mentioned in this question don't seem to have that capability.
I prefer a generic module that would work under Ubuntu, not a Windows-dependent one.
EDIT: I will also appreciate ways to embed images within the created charts, as I can create the charts in an external program and place them within the right sheet.
Thanks,
Adam

I recently found xlsxwriter. It's the most capable xlsx python module I've found and works with charts and graphs. It also doesn't require any non standard python modules and works on any type of box. No need for windows or to have charting software installed.

On Windows, you'd need to use pywin32 and COM. On a *x box, you may find that a combination of Iron Python, Mono, and an Excel-manipulation library written for .NET may do the job. In either case, good luck.

It's a little bit convoluted (and/or evil), but something like this will work cross-platform (including under Linux) using JPype to wrap the SmartXLS Excel Java library.
This example uses the simple chart creation (in Charts/ChartSample.class) example from SmartXLS.
#!/usr/bin/env python
import os
import os.path
import jpype
# or wherever your java is installed
os.environ['JAVA_HOME'] = "/usr/lib64/jvm/default-java"
root = os.path.abspath(os.path.dirname(__file__))
SX_JAR = os.path.join(root, 'SX.jar')
options = [
'-Djava.class.path=%s' % SX_JAR
]
jpype.startJVM(jpype.getDefaultJVMPath(), *options)
WorkBook = jpype.JClass('com.smartxls.WorkBook')
ChartShape = jpype.JClass('com.smartxls.ChartShape')
ChartFormat = jpype.JClass('com.smartxls.ChartFormat')
Color = jpype.JClass('java.awt.Color')
workbook = WorkBook()
workbook.setText(0,1,"Jan")
workbook.setText(0,2,"Feb")
workbook.setText(0,3,"Mar")
workbook.setText(0,4,"Apr")
workbook.setText(0,5,"Jun")
workbook.setText(1,0,"Comfrey")
workbook.setText(2,0,"Bananas")
workbook.setText(3,0,"Papaya")
workbook.setText(4,0,"Mango")
workbook.setText(5,0,"Lilikoi")
for col in range(1, 5 + 1):
for row in range(1, 5 + 1):
workbook.setFormula(row, col, "RAND()")
workbook.setText(6, 0, "Total")
workbook.setFormula(6, 1, "SUM(B2:B6)")
workbook.setSelection("B7:F7")
# auto fill the range with the first cell's formula or data
workbook.editCopyRight()
left = 1.0
top = 7.0
right = 13.0
bottom = 31.0
# create chart with it's location
chart = workbook.addChart(left,top,right,bottom)
chart.setChartType(ChartShape.Column)
# link data source, link each series to columns(true to rows).
chart.setLinkRange("Sheet1!$a$1:$F$6", False)
# set axis title
chart.setAxisTitle(ChartShape.XAxis, 0, "X-axis data")
chart.setAxisTitle(ChartShape.YAxis, 0, "Y-axis data")
# set series name
chart.setSeriesName(0, "My Series number 1")
chart.setSeriesName(1, "My Series number 2")
chart.setSeriesName(2, "My Series number 3")
chart.setSeriesName(3, "My Series number 4")
chart.setSeriesName(4, "My Series number 5")
chart.setTitle("My Chart")
# set plot area's color to darkgray
chartFormat = chart.getPlotFormat()
chartFormat.setSolid()
chartFormat.setForeColor(Color.DARK_GRAY.getRGB())
chart.setPlotFormat(chartFormat)
# set series 0's color to blue
seriesformat = chart.getSeriesFormat(0)
seriesformat.setSolid()
seriesformat.setForeColor(Color.BLUE.getRGB())
chart.setSeriesFormat(0, seriesformat)
# set series 1's color to red
seriesformat = chart.getSeriesFormat(1)
seriesformat.setSolid()
seriesformat.setForeColor(Color.RED.getRGB())
chart.setSeriesFormat(1, seriesformat)
# set chart title's font property
titleformat = chart.getTitleFormat()
titleformat.setFontSize(14*20)
titleformat.setFontUnderline(True)
chart.setTitleFormat(titleformat)
workbook.write("./Chart.xls")
jpype.shutdownJVM()

Related

HDF5 tagging datasets to events in other datasets

I am sampling time series data off various machines, and every so often need to collect a large high frequency burst of data from another device and append it to the time series data.
Imagine I am measuring temperature over time, and then every 10 degrees increase in temperature I sample a micro at 200khz, I want to be able to tag the large burst of micro data to a timestamp in the time-series data. Maybe even in the form of a figure.
I was trying to do this with regionref, but am struggling to find a elegant solution. and I'm finding myself juggling between pandas store and h5py and it just feels messy.
Initially I thought I would be able to make separate datasets from the burst-data then use reference or links to timestamps in the time-series data. But no luck so far.
Any way to reference a large packet of data to a timestamp in another pile of data would be appreciated!
How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.
Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.
To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:
Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
Repeat steps 2 and 3 for all sources (or slices of sources)
Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)
There are at least 3 other SO questions and answers on this topic:
h5py, enums, and VirtualLayout
h5py error reading virtual dataset into NumPy array
How to combine multiple hdf5 files into one file and dataset?
Example follows:
Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log' and each burst is stored in a separate dataset named: 'burst_log_##'.
import numpy as np
import h5py
log_ntimes = 31
log_inc = 1e-3
arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
time = i*log_inc
arr[i,0] = time
#temp = 70.+ 100.*time
#print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')
arr[:,1] = 70.+ 100.*arr[:,0]
#print(arr)
with h5py.File('SO_72654160.h5','w') as h5f:
h5f.create_dataset('data_log',data=arr)
n_bursts = 4
burst_ntimes = 11
burst_inc = 5e-5
for n in range(1,n_bursts):
arr = np.zeros((burst_ntimes-1,2))
for i in range(1,burst_ntimes):
burst_time = 0.01*(n)
time = burst_time + i*burst_inc
arr[i-1,0] = time
#temp = 70.+ 100.*t
arr[:,1] = 70.+ 100.*arr[:,0]
with h5py.File('SO_72654160.h5','a') as h5f:
h5f.create_dataset(f'burst_log_{n:02}',data=arr)
Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)
source_file = 'SO_72654160.h5'
a0 = 0
with h5py.File(source_file, 'r') as h5f:
for ds_name in h5f:
a0 += h5f[ds_name].shape[0]
print(f'Total data rows in source = {a0}')
# alternate getting data from
# dataset: data_log, get rows 0-11, 11-21, 21-31
# datasets: burst_log_01, burst log_02, etc (each has 10 rows)
# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)
# Map virstual dataset to logged data
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2
layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2
layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')
# Open EXISTING file, then add virtual dataset
with h5py.File('SO_72654160.h5', 'a') as h5vds:
h5vds.create_virtual_dataset("vdata", layout)
print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')

Convert win32com.client Range to Pandas Dataframe?

I am writing some macros that call Python code to perform operations on ranges in Excel. It is much easier to do a lot of the required operations with pandas in Python. Because I want to do this while the spreadsheet is open (and may not have been saved), I am using the win32com.client to read in a range of cells to convert to a Pandas dataframe. However, this is extremely slow, presumably because the way I calculate it is very inefficient:
import datetime
import pytz
import pandas
import time
import win32com.client
def range_to_table(excelRange, tsy, tsx, height, width, add_cell_refs = True):
ii = 0
keys = []
while ii < width:
keys.append(str(excelRange[ii]))
ii += 1
colnumbers = {key:jj+tsx for jj, key in enumerate(keys)}
keys.append('rownumber')
mydict = {key:[] for key in keys}
while ii < width*height:
mydict[keys[ii%width]].append(excelRange[ii].value)
ii += 1
for yy in range(tsy + 1, tsy + 1 + height - 1): # add 1 to not include header
mydict['rownumber'].append(yy)
return (mydict, colnumbers)
ExcelApp = win32com.client.GetActiveObject('Excel.Application')
wb = ExcelApp.Workbooks('myworkbook.xlsm')
sheet_num = [sheet.Name for sheet in wb.Sheets].index("myworksheet name") + 1
ws = wb.Worksheets(sheet_num)
height = int(ws.Cells(1, 3)) # obtain table height from formula in excel spreadsheet
width = int(ws.Cells(1, 2)) # obtain table width from formula in excel spreadsheet
myrange = ws.Range(ws.Cells(2, 1), ws.Cells(2 + height - 1, 1 + width - 1))
df, colnumbers = range_to_table(myrange, 1, 1, height, width)
df = pandas.DataFrame.from_dict(df)
This works, but the range_to_table function I wrote is extremely slow for large tables since it iterates over each cell one by one.
I suspect there is probably a much better way to convert the Excel Range object to a Pandas dataframe. Do you know of a better way?
Here a simplified example of what my range would look like:
The height and width variables in the code are just taken from cells immediately above the table:
Any ideas here, or am I just going to have to save the workbook and use Pandas to read in the table from the saved file?
There are two parts to the operation: defining the spreadsheet range and then getting the data into Python. Here is the test data that I'm working with:
1. Defining the range: Excel has a feature called Dynamic Ranges. This allows you to give a name to a range whose extent is variable.
I've set up a dynamic range called 'DynRange', and you can see that it uses the row and column counts from $C$1 and $C$2 to define the size of the array.
Once you have this definition, the range can be used by Name in Python, and saves you having to access the row and column count explicitly.
2. Using this range in Python via win32.com: Once you have defined the name in Excel, handling it in Python is much easier.
import win32com.client as wc
import pandas as pd
#Create a dispatch interface
xl = wc.gencache.EnsureDispatch('Excel.Application')
filepath = 'SomeFilePath\\TestBook.xlsx'
#Open the workbook
wb = xl.Workbooks.Open(filepath)
#Get the Worksheet by name
ws = wb.Sheets('Sheet1')
#Use the Value property to get all the data in the range
listVals = ws.Range('DynRange').Value
#Construct the dataframe, using first row as headers
df = pd.DataFrame(listVals[1:],columns=listVals[0])
#Optionally process the datetime value to avoid tz warnings
df['Datetime'] = df['Datetime'].dt.tz_convert(None)
print(df)
wb.Close()
Output:
Datetime Principal Source Amt Cost Basis
0 2021-04-21 04:59:00 -5.0 1.001 5.0
1 2021-04-25 15:16:00 -348.26 1.001 10.0
2 2021-04-29 11:04:00 0.0 1.001 5.0
3 2021-04-29 21:26:00 0.0 1.001 5.0
4 2021-04-29 23:39:00 0.0 1.001 5.0
5 2021-05-02 14:00:00 -2488.4 1.001 5.0
As the OP suspects, iterating over the range cell-by-cell performs slowly. The COM infrastructure has to do a good deal of processing to pass data from one process (Excel) to another (Python). This is known as 'marshalling'. Most of the time is spent packing up the variables on one side and unpacking on the other. It is much more efficient to marshal the entire contents of an Excel Range in one go (as a 2D array) and Excel allows this by exposing the Value property on the Range as a whole, rather than by cell.
You can try using multiprocessing for this. You could have each worker scanning a different column for example, or even do the same on the lines.
Minor changes to your code are needed:
Create a function iterating over the columns and storing the information in a dict
Use the simple multiprocessing example # https://pymotw.com/2/multiprocessing/basics.html
Create a function appending all different dicts created by each worker into a single one
That should divide your compute time by the amount of workers used.

How to make subplots from multiple files? Python matplot lib

I'm a student researcher who's running simulations on exoplanets to determine if they might be viable for life. The software I'm using, outputs a file with several columns of various types of data. So far, I've written a python script that goes through one file and grabs two columns of data. In this case, time and global temperature of the planet.
What I want to do is:
Write a python script that goes through multiple files, and grabs the same two columns that my current script does.
Then, I want to create subplots of all these files
The things that will stay consistent across all of the files, is the fact that times doesn't change, the x axis will always be time (from 0 to 1 million years). The y axis values will changes across simulations though.
This is what I got so far for my code:
import math as m
import matplotlib.pylab as plt
import numpy as np
## Set datafile equal to the file I plan on using for data, and then open it
datafile = r"C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]\solarsys.Earth.forward"
file = open(datafile, "r")
# Create two empty arrays for my x and y axis of my graphs
years = [ ]
GlobalT = [ ]
# A for loop that looks in my file, and grabs only the 1st and 8th column adding them to the respective arrays
for line in file:
data = line.split(' ')
years.append(float(data[0]))
GlobalT.append(float(data[7]))
# Close the file
file.close()
# Plot my graph
fig = plt.matplotlib.pyplot.gcf()
plt.plot(years, GlobalT)
plt.title('Global Temperature of GJ 229 b over time')
fig.set_size_inches(10, 6, forward=True)
plt.figtext(0.5, 0.0002, "This shows the global temperature of GJ 229 b when it's semi-major axis is 0.929 au, \n"
" and it's actual mass relative to the sun (~8 Earth Masses)", wrap=True, horizontalalignment='center', fontsize=12)
plt.xlabel(" Years ")
plt.ylabel("Global Temp")
plt.show()
I think the simplest thing to do is to turn your code for one file into a function, and then call it in a loop that iterates over the files.
from pathlib import Path
def parse_datafile(pth):
"""Parses datafile"""
results = []
with pth.open('r') as f:
for line in f:
data = line.split(' ')
results.append({'f': pth.stem,
'y': data[0],
't': data[7]})
return results
basedir = Path(r'C:\Users\sasuk\OneDrive\Desktop\GJ 229 Planet Data\Gj 229 b - [SemiMajor 0.867][Ecc][Mass][Luminosity]')
# assuming you want to parse all files in directory
# if not, can change glob string for files you want
all_results = [parse_datafile(pth) for pth in basedir.glob('*')]
df = pd.DataFrame(all_results)
df['y'] = pd.to_numeric(df['y'], errors='coerce')
df['t'] = pd.to_numeric(df['t'], errors='coerce')
This will give you a dataframe with three columns - f (the filename), y (the year), and t (the temperature). You then have to convert y and t to numeric dtypes. This will be faster and handle errors more gracefully than your code, which will raise an error with any malformed data.
You can further manipulate this as needed to generate your plots. Definitely check if there are any NaN values and address them accordingly, either by dropping those rows or using fillna.

Separate TAP and HOVER tool for Edges of hv.Graph. Edge description data is missing

Trying to get hv graph with ability to tap edges separately from nodes. In my case - all meaningful data bound to edges.
gNodes = hv.Nodes((nodes_data.x,nodes_data.y, nodes_data.nid, nodes_data.name),\
vdims=['name'])
gGraph = hv.Graph(((edges_data.source, edges_data.target, edges_data.name),gNodes),vdims=['name'])
opts = dict(width=1200,height=800,xaxis=None,yaxis=None,bgcolor='black',show_grid=True)
gEdges = gGraph.edgepaths
tiles = gv.tile_sources.Wikipedia()
(tiles * gGraph.edgepaths * gGraph.nodes.opts(size=12)).opts(**opts)
If I use gGraph.edgepaths * gGraph.nodes - where is no edge information displayed with Hover tool.
Inspection policy 'edges' for hv.Graph is not suitable for my task, because no single edge selection available.
Where did edge label information in edgepaths property gone? How to add it?
Thank you!
I've created separate dataframe for each link, then i grouped it by unique link label, and insert empty row between each group (two rows for edge - source and target), like in this case: Pandas: Inserting an empty row after every 2nd row in a data frame
emty_row = pd.Series(np.NaN,edges_data.columns)
insert_f = lambda d: d.append(emty_row, ignore_index=True)
edges_df = edges_test.groupby(by='name', group_keys=False).apply(insert_f).reset_index(drop=True)
and create hv.EdgesPaths from df:
gPaths2= hv.EdgePaths(edges_df, kdims=['lon_conv_a','lat_conv_a'])
TAP and HOVER works fine for me.

Why matplotlib draws me the new graphic superimposing the old one?

I'm working on django project and using the matplotlib library. Theoretically I have created a filter where you can choose the day and and "node" that you want to graph and with this information a pythonscript is executed that together with pandas and matplotlib creates a graph.
The values ​​of "node" and "day" arrive correctly to the script, and this generates the graphic well. But the only thing wrong is that instead of overwriting the old image (with the previous graphic), draw the new lines on it. Next I show an image of how it looks.
As you can see, each line is equivalent to a different day, because it has been overlapping the different tests I have done. Can anyone tell me where I fail?
Below I attach code
def bateria2(node, day):
csv_path = os.path.join(os.path.dirname(__file__), '..\\data\\csv\\dataframe.csv')
df = pd.read_csv(csv_path)
mes, anyo = 12, 2019
new_df = df[(df['Dia'] == day) & (df['Mes'] == mes) & (df['Año'] == anyo) & (df['Node name'] == node)]
if len(new_df) > 0:
#os.remove('static\\img\\bateria2.png')
x = new_df['Hora[UTC]'].tolist()
y = new_df['Bateria'].tolist()
title = 'Carga/Descarga de la batería día '+str(day)+'/'+str(mes)+'/'+str(anyo)+' de '+str(node)
plt.title(title)
plt.xlabel('Hora [UTC]')
plt.ylabel('Batería')
#plt.legend((y)(node))
plt.plot(x,y)
plt.xticks(x, rotation='vertical')
plt.savefig('static\\img\\bateria2.png',transparent=True)
return 1
else:
return 0
Basically what I'm doing it is to access the .csv file that contains the info, filter according to the data that I want. And if the new dataframe generated has data, create the graph to finally save it.
Regards thank you very much.
Try to clear the current figure, plt.clf() after your savefig command. This should keep your plots from stacking up on top of each other.

Categories