Extracting URL from inside docx tables

Extracting URL from inside docx tables - python

I'm pretty much stuck right now.
I wrote a parser in python3 using the python-docx library to extract all tables found in an existing .docx and store it in a python datastructure.
So far so good. Works as it should.
Now I have the problem that there are hyperlinks in these tables which I definitely need! Due to the structure (xml underneath) the docx library doesn't catch these. Neither the url nor the display text provided. I found many people having similar concerns about this, but most didn't seem to have 'just that' dilemma.
I thought about unpacking the .docx and scan the _ref document for the corresponding 'rid' and fill the actual data I have with the links found in the _ref xml.
Either way it seems seriously weary to do it that way, so I was wondering if there is a more pythonic way to do it or if somebody got good advise how to tackle this problem?

You can extract the links by parsing xml of docx file.
You can extract all text from the document by using document.element.getiterator()
Iterate all the tags of xml and extract its text. You will get all the missing data which python-docx failed to extract.

Related

python-docx cannot read a table inserted from excel

I need to deal with tables in many word files. Some of them are created in word table format, which can be read using python-docx.
However, some of them are inserted from excel. I don't know why python-docx cannot read them. Here is piece of code I wrote for test. As you can see in the terminal, there is nothings in the list variable 'tables'.
import docx
from docx import Document
docFile = 'a.docx'
document = Document(docFile)
tables = document.tables
print(tables)
Anyone can help? Thanks a lot!

I'm fighting the same issue using Pages on OSX to create a .docx template. I've found that Format > Arrange > Object Placement needs to be set to Move with text for the table, changing it to have any alignment or formatting causes the tables to disappear in python and be read as paragraphs that contain nothing. Looking at the XML of both and the python-docx code I'm suspicious of w:tblInd but I'm not clued up enough to go much further. I see recent GitHub issues covering this so hopefully will get sorted.
example on OSX:

Which Python libraries to use for analyzing doc and docx files?

I am writing a doc and docx parser. It is necessary to obtain various metadata about the document of these formats. For example, for docx, I need to get the XML code and continue to work with the tags. Tell me the solutions that will help solve my problem? Solutions like python-docx are not suitable, because they work only with text.

If you need raw docx data, you'll probably work with it low-level, i.e. open file with zipfile and read meta with xml etree

How to remove all the unnecessary tags and signs from html files?

I am trying to extract "only" text information from 10-K reports (e.g. company's proxy reports) on SEC's EDGAR system by using Python's BeautifulSoup or HTMLParser. However, the parsers that I am using do not seem to work well onto the 'txt'-format files, including a large portion of meaningless signs and tags along with some xbrl information, which is not needed at all. However, when I apply the parser directly onto 'htm'-format files, which are more or less free from the issues of meaningless tags, the parser seems works relatively fine.
"""for Python 3, from urllib.request import urlopen"""
from urllib2 import urlopen
from bs4 import BeautifulSoup
"""for extracting text data only from txt format"""
txt = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/0001660156-16-000019.txt")
bs_txt = BeautifulSoup(txt.read())
bs_txt_text = bs_txt.get_text()
len(bs_txt_text) # 400051
"""for extracting text data only from htm format"""
html = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/f201510kzec2_10k.htm")
bs_html = BeautifulSoup(html.read())
bs_html_text = bs_html.get_text()
len(bs_html_text) # 98042
But the issue is I am in a position to rely on 'txt'-format files, not on 'htm' ones, so my question is, is there any way to deal with removing all the meaningless signs and tags from the files and extracting only text information as the one directly extracted from 'htm' files? I am relatively new to parsing using Python, so if you have any idea on this, it would be of great help. Thank you in advance!

The best way to deal with XBRL data is to use an XBRL processor such as the open-source Arelle (note: I have no affiliation with them) or other proprietary engines.
You can then look at the data with a higher level of abstraction. In terms of the XBRL data model, the process you describe in the question involves
looking for concepts that are text blocks (textBlockItemType) in the taxonomy;
retrieving the value of the facts reported against these concepts in the instance;
additionally, obtaining some meta-information regarding it: who (reporting entity), when (XBRL period), what the text is about (concept metadata and documentation), etc.
An XBRL processor will save you the efforts of resolving the entire DTS as well as dealing with the complexity of the low-level syntax.
The second most appropriate way is to use an XML parser, maybe with an XML Schema engine as well as XQuery or XSLT, but this will require more work as you will need to either:
look at the XML Schema (XBRL taxonomy schema) files, recursively navigating them and looking for text block concepts, deal with namespaces, links, and so on (which an XBRL processor shields you from)
or only look at the instance, ideally the XML file (e.g., https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/zeci-20151231.xml ) with a few hacks (such as taking XML elements ending with TextBlock), but this is at your own risks and not recommended as this bypasses the taxonomy.
Finally, as you suggest in the original question, you can also look at the document-format files (HTML, etc) rather than at the data files of the SEC filing, however in this case it defeats the purpose of using XBRL, which is to make the data understandable by a computer thanks to tags and contexts, and it may miss important context information associated with the text -- a bit like opening a spreadsheet file with a text/hex editor.
Of course, there are use cases that could justify using that last approach such as running natural language processing algorithms. All I am saying is that this is then outside of the scope of XBRL.

There is an HTML tag stripper at the pyparsing wiki Examples page. It does not try to build an HTML doc, it merely looks for HTML and script tags and strips them out.

Processing a PDF for information extraction

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.

PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

edit in place using xpath

Is it possible to do in place edit of XML document using xpath ?
I'd prefer any python solution but Java would be fine too.

XPath is not intended to edit document in place, as far as I know. It is intended to only select nodes of the document. XSLT relies on XPath and can transform documents.
Regarding Python, see answer to this question: how to use xpath in python. It mentions also libraries which can do XSLT transformations.

Using XML to store data is probably not optimal, as you experience here. Editing XML is extremely costly.
One way of doing the editing is parsing the xml into a tree, and then inserting stuff into that three, and then rebuilding the xml file.
Editing an xml file in place is also possible, but then you need some kind of search mechanism that finds the location you need to edit or insert into, and then write to the file from that point. Remember to also read the remaining data, because it will be overwritten. This is fine for inserting new tags or data, but editing existing data makes it even more complicated.
My own rule is to not use XML for storage, but to present data. So the storage facility, or some kind of middle man, needs to form xml files from the data it has.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting URL from inside docx tables - python

You can extract the links by parsing xml of docx file. You can extract all text from the document by using document.element.getiterator() Iterate all the tags of xml and extract its text. You will get all the missing data which python-docx failed to extract.

Related

python-docx cannot read a table inserted from excel

Which Python libraries to use for analyzing doc and docx files?

How to remove all the unnecessary tags and signs from html files?

Processing a PDF for information extraction

edit in place using xpath

Categories

Resources