Extracting html structure from PDF

Extracting html structure from PDF - python

I have a test pdf file with just a 3x3 table that are marked properly with table headings and the sort. What I want to do is extract the format of the table. Like so:
left
center
right
One
Two
Three
If that table was in the pdf, I want to be able to know programmatically that the table has three headers "" and one row of data. ""
I am using fitz and when i use this code:
for page in doc:
tp = page.get_textpage() # display list from above
html = tp.extractHTML() # HTML format
print(html)
It seems to just remove all the actual html and replace it with just paragraph tags and div tags. What am I doing wrong?

Related

Get text from table in .docx file using python

I need to get full text of document as python string. So, I use docx library:
doc = docx.Document(user_file)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
text = '\n'.join(fullText)
It works, but ignore text in tables. How should I get data from tables? Maybe there is any way to clear tags or somehow prepare document? Thanks in advance!

doc.tables returns a list of Table instances corresponding to the tables in the document, in document order. Note that only tables appearing at the top level of the document appear in this list; a table nested inside a table cell does not appear. A table within revision marks such as <w:ins> or <w:del> will also not appear in the list.

Understanding encodings in HTML

I am parsing a .html file using BeautifulSoup4 doing the following:
data = [item.text.strip() for item in soup.find_all('span')]
The code takes all the items in a given table and stores into data. I noticed some of the elements in the data contain texts what seems like html entity encoding. An example element:
data[5] stores 'CSCI-GA.1144-\u200b001'
the text I expected was just CSCI-GA.1144-001'
In the html file, I find it as 'CSCI-GA.1144-001'
Why does it show differently when I parse, vs when I inspect the html code? And how do I parse the data so it does not take into account these encodings? Is there a way to exclude?

Extracting .docx data, images and structure

Good day SO,
I have a task where I need to extract specific parts of a document template (For automation purposes). While I am able to traverse, and know the current position, of the document during traversal (via checking for Regex, keywords, etc.), I am unable to extract:
The structure of the document
Detect Images that are in-between text
Am I able to obtain, for example, an array of the structure of the document below?
['Paragraph1','Paragraph2','Image1','Image2','Paragraph3','Paragraph4','Image3','Image4']
My current implementation is shown below:
from docx import Document
document = docx.Document('demo.docx')
text = []
for x in document.paragraphs:
if x.text != '':
text.append(x.text)
Using the code above, I am able to obtain all the Text data from the document, but I am unable to detect the type of text (Header or Normal), and I am unable to detect any Images. I am currently using python-docx.
My main problem is to obtain the position of the image within the document (i.e. between paragraphs) so that I can re-create another document, using text and images extracted. This task requires me to know where the image appears in the document, and where to insert the image in the new document.
Any help is greatly appreciated, thank you :)

For extracting the structure of the paragraph and heading you can use the built-in objects in python-docx. Check this code.
from docx import Document
document = docx.Document('demo.docx')
text = []
style = []
for x in document.paragraphs:
if x.text != '':
style.append(x.style.name)
text.append(x.text)
with x.style.name you can get the styling of text in your document.
You can't get the information regarding images in python-docx. For that, you need to parse the xml. Check XML ouput by
for elem in document.element.getiterator():
print(elem.tag)
Let me know if you need anything else.
For extracting image name and its location use this.
tags = []
text = []
for t in doc.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
print('Picture Found: ',t.attrib['name'])
tags.append('Picture')
text.append(t.attrib['name'])
elif t.text:
tags.append('text')
text.append(t.text)
You can check previous and next text from text list and their tag from the tag list.

Extract multiple types of text from a column in html

I'm new to Python and I'm trying to extract data from a html page. There is a certain column of the table which is a mixture of text and URLs. I'd like to extract all the information from that column, keeping the links intact to a csv file (which I'll later save as an Excel file). Please advise me. Here's my code to extract just the text.
trs = soup.find_all('tr')
for tr in trs:
tds = tr.find_all("td")
try:
RS_id = str(tds[5].get_text().encode('utf-8'))
A few cells of the column have multiple URLs and I'd like to keep them the same.

How is the data in that column written? If there is a clear pattern for how the URL is separated by other text, then you can use the string.split('character') command.
Say the column of data you care about has all of the entries split apart by a ',' character, then you would say:
column_data=RS_id.split(',')
This would give you a list of everything listed in that column, splitting it up every time there is a comma character. Then you just index the list to get the URL you're after. If there is no particular order to index the list by, you may have to do something like:
URL_list=[]
for item in column_data:
if 'http' in item: URL_list.append(item)
EDIT:
check out how beautifulsoup parses the table: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
There should be a .href attribute for the text, which is the URL the hyperlink links to.

How to extract nested tables from HTML?

I have an HTML file (encoded in utf-8). I open it with codecs.open(). The file architecture is:
<html>
// header
<body>
// some text
<table>
// some rows with cells here
// some cells contains tables
</table>
// maybe some text here
<table>
// a form and other stuff
</table>
// probably some more text
</body></html>
I need to retrieve only first table (discard the one with form). Omit all input before first <table> and after corresponding </table>. Some cells contains also paragraphs, bolds and scripts. There is no more than one nested table per row of main table.
How can I extract it to get a list of rows, where each elements holds plain (unicode string) cell's data and a list of rows for each nested table? There's no more than 1 level of nesting.
I tried HTMLParse, PyParse and re module, but can't get this working.
I'm quite new to Python.

Try beautiful soup
In principle you need to use a real parser (which Beaut. Soup is), regex cannot deal with nested elements, for computer sciencey reasons (finite state machines can't parse context-free grammars, IIRC)

You may like lxml. I'm not sure I really understood what you want to do with that structure, but maybe this example will help...
import lxml.html
def process_row(row):
for cell in row.xpath('./td'):
inner_tables = cell.xpath('./table')
if len(inner_tables) < 1:
yield cell.text_content()
else:
yield [process_table(t) for t in inner_tables]
def process_table(table):
return [process_row(row) for row in table.xpath('./tr')]
html = lxml.html.parse('test.html')
first_table = html.xpath('//body/table[1]')[0]
data = process_table(first_table))

If the HTML is well-formed you can parse it into a DOM tree and use XPath to extract the table you want. I usually use lxml for parsing XML, and it can parse HTML as well.
The XPath for pulling out the first table would be "//table[1]".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting html structure from PDF - python

Related

Get text from table in .docx file using python

Understanding encodings in HTML

Extracting .docx data, images and structure

Extract multiple types of text from a column in html

How to extract nested tables from HTML?

Categories

Resources