Identify external workbook links using openpyxl

Identify external workbook links using openpyxl - python

I am trying to identify all cells that contain external workbook references, using openpyxl in Python 3.4. But I am failing. My first try consisted of:
def find_external_value(cell):
# identifies an external link in a given cell
if '.xls' in cell.value:
has_external_reference = True
return has_external_value
However, when I print the cell values that have external values to the console, it yields this:
=[1]Sheet1!$B$4
=[2]Sheet1!$B$4
So, openpyxl obviously does not parse formulas containing external values in the way I imagined and since square brackets are used for table formulas, there is no sense in trying to pick up on external links in this manner.
I dug a little deeper and found the detect_external_links function in the openpyxl.workbook.names.external module (reference). I have no idea if one can actually call this function to do what I want.
From the console results it seems as if openpyxl understands that there are references, and seems to contain them in a list of sorts. But can one access this list? Or detect if such a list exists?
Whichever way - all I need is to figure out if a cell contains a link to an external workbook.

I have found a solution to this.
Use the openpyxl library for load the xlsx file as
import openpyxl
wb=openpyxl.load_workbook("Myworkbook.xlsx")
"""len(wb._external_links) *Add this line to get count of linked workbooks*"""
items=wb._external_links
for index, item in enumerate(items):
Mystr =wb._external_links[index].file_link.Target
Mystr=Mystr.replace("file:///","")
print(Mystr.replace("%20"," "))
----------------------------
Out[01]: ##Indicates that the workbook has 4 external workbook links##
/Users/myohannan/AppData/Local/Temp/49/orion/Extension Workpapers_Learning Extension Calc W_83180610.xlsx
/Users/lmmeyer/AppData/Local/Temp/orion/Complete Set of Workpapers_PPS Workpapers 123112_111698213.xlsx
\\SF-DATA-2\IBData\TEMP\ie5\Temporary Internet Files\OLK8A\LBO Models\PIGLET Current.xls
/WINNT/Temporary Internet Files/OLK3/WINDOWS/Temporary Internet Files/OLK8304/DEZ.XLS

I decided to veer outside of openpyxl in order to achieve my goal - even though openpyxl has numerous functions that refer to external links I was unable to find a simple way to achieve my goal.
Instead I decided to use ZipFile to open the workbook in memory, then search for the externalLink1.xml file. If it exists, then the workbook contains external links:
import tkinter as tk
from tkinter import filedialog
from zipfile import ZipFile
Import xml.etree.ElementTree
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
with ZipFile(file_path) as myzip:
try:
my_file = myzip.open('xl/externalLinks/externalLink1.xml')
e = xml.etree.ElementTree.parse(my_file).getroot()
print('Has external references')
except:
print('No external references')
Once I have the XML file, I can proceed to identify the cell address, value and other information by running through the XML tree using ElementTree.

Related

How to read outline levels using Python `openpyxl`?

My organization has a clean export for bills of materials (BOM). I would like to automatically parse the excel file to check the BOM for certain attributes.
At the moment, I'm using Python with openpyxl.
I can read the excel workbook and worksheet just fine, but I cannot seem to find the attribute that contains the "outline level" of each row (I fully concede that I may be using the wrong terminology... another term candidate might be "group").
When I look at my excel file using excel, I see this at the left of the screen:
I would like to extract the 1 2 3 4 5 from each of the rows and to tell what grouping they were in.
My initial code is:
from pathlib import Path
import openpyxl as xl
path = Path('<path-to-my-file>.xlsx')
wb = xl.load_workbook(filename=path)
sh = wb.worksheets[0]
# ... would like to put outline level reading code here
From reading other questions, I suspect that I need to look at the row_dimension.group method of the worksheet, but I can't seem to get a handle on the syntax or the exact attribute that I'm looking for.

Thanks for the post. I was struggling with the same problem and seing your post gave me an idea!
I overcome it with the following code:
from pathlib import Path
import openpyxl as xl
path = Path('<path-to-my-file>.xlsx')
wb = xl.load_workbook(filename=path)
sh = wb.worksheets[0]
for row in sorted(sheet.row_dimensions):
outline1=sheet.dimensions[row].outlineLevel
outline2=sheet.dimensions[row].outline_level
print(row,sheet.dimensions[row], outline1, outline2 )

Maybe you can use the following code to gather individual row outline levels as an integer. I use a similar code to find maximum outline level in a sheet with some more lines.
for index in range(ws.min_row, ws.max_row):
row_level = ws.row_dimensions[index].outline_level + 1
In here row level variable is the outline level, you may use as required. But please double check +1, if I remember correctly, to get true level, you need to increase variable by one.

Python xlwt produces AttributeError when searching for empty cell in Excel spreadsheet file

I have an Excel file and I am using Python to fill its rows and columns.
I want to use the following function to find the first empty row in the table and fill it:
from xlwt import Workbook, easyxf
def next_available_row(sheet):
str_list = filter(None, sheet.col_values(1)) # error
return str(len(str_list)+1)
wb=Workbook()
sheet=wb.add_sheet('sheet1')
sheet.write(0,0,'item')
sheet.write(0,1,'cost')
sheet.write(next_available_row(sheet),0,'potato')
sheet.write(next_available_row(sheet),1,4)
but I get the following error:
AttributeError: 'sheet' object has no attribute 'col_values'
What should I do?

The library you are using, xlwt, is for writing .xls spreadsheets only, and does not have the method col_values (to read its contents), as the error message already states (correctly).
The function next_available_row() (from How to find the first empty row of a google spread sheet using python GSPREAD?) that you want to use to search for an empty cell is based on a different library, gspread, and that is apparently not for Excel files (e.g. .xls, note there are several versions of this file type).
So you probably are looking for an entirely different library, one that reads and writes Excel files.
http://www.python-excel.org/ lists several libraries (including your xlrd):
https://pypi.python.org/pypi/xlrd
https://pypi.python.org/pypi/xlwt
https://pypi.python.org/pypi/XlsxWriter
https://pypi.python.org/pypi/openpyxl
Or maybe try to manage something by reading the file first, e.g. with xlwt's sister project, xlrd.

Seems that has no col_values method on xlwt API. http://xlwt.readthedocs.io/en/latest/api.html
Maybe using together the xlrd you can reach your goal.
http://xlrd.readthedocs.io/en/latest/api.html?highlight=col_values#xlrd-sheet

Add data to existing excel file using python

I am trying to add data to an existing excel file, the problem I am facing is that the data is getting imported but the equation and the format is being deleted in original file.
I attached my code below
import xlwt
import xlrd
from xlutils.copy import copy
#open the excel file
rb=xlrd.open_workbook('Voltage_T.xlsx')
#make a writable copy of the opened excel file
wb=copy(rb)
#read the first sheet to write to within the writable copy
w_sheet=wb.get_sheet(0)
#write or modify the value at 2nd row first column
w_sheet.write(0,1,'WWW.GOOGLE.COM')
#the last step saving the work book
wb.save('Voltage_WW.xls')

You need to set formatting_info to true
rb=xlrd.open_workbook('Voltage_T.xlsx', formatting_info = True)
However xlrd doesn't support xlsx with formatting_info at the moment. So if you really have to use .xlsx you will need another library.
I didn't used it myself so I can't tell you if it's a good library but thanks to a quick search on google XlsxWriter seems to answer your needs.

Iterating over rows in a column with XLRD

I have been able to get the column to output the values of the column in a separated list. However I need to retain these values and use them one by one to perform an Amazon lookup with them. The amazon lookup is not the problem. Getting XLRD to give one value at a time has been a problem. Is there also an efficient method of setting a time in Python? The only answer I have found to the timer issue is recording the time the process started and counting from there. I would prefer just a timer. This question is somewhat two parts here is what I have done so far.
I load the spreadsheet with xlrd using argv[1] i copy it to a new spreadsheet name using argv[2]; argv[3] i need to be the timer entity however I am not that far yet.
I have tried:
import sys
import datetime
import os
import xlrd
from xlrd.book import colname
from xlrd.book import row
import xlwt
import xlutils
import shutil
import bottlenose
AMAZON_ACCESS_KEY_ID = "######"
AMAZON_SECRET_KEY = "####"
print "Executing ISBN Amazon Lookup Script -- Please be sure to execute it python amazon.py input.xls output.xls 60(seconds between database queries)"
print "Copying original XLS spreadsheet to new spreadsheet file specified as the second arguement on the command line."
print "Loading Amazon Account information . . "
amazon = bottlenose.Amazon(AMAZON_ACCESS_KEY_ID, AMAZON_SECRET_KEY)
response = amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
shutil.copy2(sys.argv[1], sys.argv[2])
print "Opening copied spreadsheet and beginning ISBN extraction. . ."
wb = xlrd.open_workbook(sys.argv[2])
print "Beginning Amazon lookup for the first ISBN number."
for row in colname(colx=2):
print amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
I know this is a little vague. Should I perhaps try doing something like column = colname(colx=2) then i could do for row in column: Any help or direction is greatly appreciated.

The use of colname() in your code is simply going to return the name of the column (e.g. 'C' by default in your case unless you've overridden the name). Also, the use of colname is outside the context of the contents of your workbook. I would think you would want to work with a specific sheet from the workbook you are loading, and from within that sheet you would want to reference the values of a column (2 in the case of your example), does this sound somewhat correct?
wb = xlrd.open_workbook(sys.argv[2])
sheet = wb.sheet_by_index(0)
for row in sheet.col(2):
print amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
Although I think looking at the call to amazon.ItemLookup() you probably want to refer to row and not to "row" as the latter is simply a string and the former is the actual contents of the variable named row from your for loop.

Get name of active Excel workbook from Python

I am trying to write a Python script that will access and modify the active Excel workbook using the Excel COM interface. However, I am having difficulty getting this to work when there are multiple Excel instances running. For example, the code
import win32com.client
xl = win32com.client.Dispatch("Excel.Application")
print(xl.ActiveWorkbook.FullName)
prints out the name of the active workbook from the first running instance of Excel only. What I really want is the workbook that I last clicked on, regardless of what Excel instance it was in.
Thanks.

EDIT FOR COMMENTS
There might be a better way to do this.
Install the excellent psutil
import psutil
excelPids = []
for proc in psutil.process_iter():
if proc.name == "EXCEL.EXE": excelPids.append(proc.pid)
Now enumerate the windows, but get the window title and pid.
windowPidsAndTitle = []
win32gui.EnumWindows(lambda hwnd, resultList: resultList.append((win32gui.GetWindowThreadProcessId(hwnd),win32gui.GetWindowText(hwnd))), windowPidsAndTitle)
Now just find the first pid that is in our excelPids
for pid,title in windowPidsAndTitle:
if pid in excelPids:
return title
END EDITS
There is a number of things to take into consideration here:
Does one instance have multiple workbooks open? In this case
xl = win32com.client.Dispatch("Excel.Application")
xl.ActiveWorkbook.FullName
Will indeed give you the last active workbook.
Or are there separate instances of EXCEL.EXE running? You can get each instance with:
xl = win32com.client.GetObjec(None, "Excel.Application") #instance one
xl = win32com.client.GetObject("Name_Of_Workbook") #instance two
But this defeats the purpose because you need to know the name AND this will not tell you which one last had focus.
To #tgrays comment above, if your excel instance is guaranteed to be the foreground window then:
import win32gui
win32gui.GetWindowText(win32gui.GetForegroundWindow())
#parse this and use GetObject to get your excel instance
But worst case scenerio, multiple instances and you have to find which had focus last, you'll have to enumerate all the windows and find the one you care about:
windows = []
win32gui.EnumWindows(lambda hwnd, resultList: resultList.append(win32gui.GetWindowText(hwnd)),windows)
#enumerates all the windows open from the top down
[i for i in windows if "Microsoft Excel" in i].pop(0)
#this one is closest to the top
Good luck with this one!

With xlwings you can simply do:
import xlwings as xw
print(xw.books.active.name)
This will correctly work even if you have multiple instances of Excel open.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identify external workbook links using openpyxl - python

Related

How to read outline levels using Python `openpyxl`?

Python xlwt produces AttributeError when searching for empty cell in Excel spreadsheet file

Add data to existing excel file using python

Iterating over rows in a column with XLRD

Get name of active Excel workbook from Python

Categories

Resources