python - convert docx to HTML including Fonts and Fonts Size

python - convert docx to HTML including Fonts and Fonts Size - python

I'm trying to convert a file from Docx to HTML with font family, fonts size and colors in Python, I tried couple of solutions i.e Python docx, docx2html, Python Mammoth.
but none of the packages works for me. these packages are converting to HTML, but many things related to styles i.e fonts, size, and colors are skipped.
I tried to open and read docx files using Python zipfile and get XML of word file, I got all the docx information in XML, so now I'm thinking of parsing XML to HTML in Python, Maybe I can find any parser for this purpose.
Here's the snippet of code that I tried with Python docx but I'm getting None values here.
d = Document('1.docx')
d_styles = d.styles
for key in d_styles:
print(f'{key} : {d_styles[key]}')
for XML using zipfile here's my code snippet.
docx = zipfile.ZipFile(path)
content = docx.read('word/document.xml').decode('utf-8')
Any help will be highly appreciated.

Related

I can´t insert tab stops into a docx generated from html

I have a very specific use case where I need to insert tab stops into Word documents. My code works perfectly when using a docx that was created normally. However, the other part of my use case is that I extract the html from a text editor and turn it into a docx. The problem is with these documents that were generated from html, for some reason when running the same code to insert tab stops it does not work. The tab stop configuration gets created but it is not applied to the document. I cannot seem to find a way around it and any help would be deeply appreciated.
Below is a code sample:
from docx import Document
from docx.shared import Inches
from docx.enum.text import WD_TAB_ALIGNMENT, WD_TAB_LEADER
from htmldocx import HtmlToDocx
new_parser = HtmlToDocx()
new_html = """<p><span>some text</span></p>
<br>
<p><span>Some persons name</span></p>
<p><span>Another text</span></p>
<p><span>Some date</span></p>"""
document = Document()
new_parser.add_html_to_document(new_html, document)
for para in document.paragraphs:
tab_stops = para.paragraph_format.tab_stops
tab_stops.add_tab_stop(Inches(5.51),
WD_TAB_ALIGNMENT.RIGHT, WD_TAB_LEADER.DOTS)
document.save('new-file-name.docx')
When running this code the tab stops configuration gets created correctly in the docx, but it is not reflected in the document itself. Below you can see the configuration correctly created:
However, those tab stops are not visible in the document itself.
This function is supposed to run on Azure functions, so pywin32 is not an option to convert html to docx as it does not run on linux.
I have tried manually setting the styles of the document. I have tried using the api of convertapi, as well as using the library aspose.words but nothing seems to work. It seems that there is something about converting html to docx that precludes inserting tab stops.
Thank you very much in advance and any help is deeply appreciated.

How to convert Web PDF to Text

I want to convert web PDF's such as - https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000's of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my PC. Any Python Code Solutions to this?
Thanks

There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :
pdfplumber
tesseract
pdftotext
...
Here is a simple code example for that (using pdfplumber)
from urllib.request import urlopen
import pdfplumber
url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
response = urlopen(url)
file = open("img.pdf", 'wb')
file.write(response.read())
file.close()
try:
pdf = pdfplumber.open('img.pdf')
except:
# Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
print(f'Error. Are you sure this is a PDF ?')
continue
#PDF plumber text extraction
page = pdf.pages[0]
text = page.extract_text()
EDIT : My bad, just realised you asked "without saving it to my PC".
That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as "img.pdf" so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(

python - read pdf ignoring header and footer

I have a pdf file that I am reading using pymupdf using the below syntax.
import fitz # this is pymupdf
with fitz.open('file.pdf') as doc:
text = ""
for page in doc:
text += page.getText()
Is there a way to ignore the header and footer while reading it?
I tried converting pdf to docx as it is easier to remove headers, but the pdf file I am working on is getting reformatted when I convert it to docx.
Is there any way pymupdf does this during the read?

The documentation has a page dedicated to this problem.
Define rectangle that omits the header
Use page.get_textbox(rect) method.
Source: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction#2-pageget_textboxrect
The generic solution that works for most pdf libraries is to
check for the size of the header/footer section in your pdf files
loop for each text in the document and check it's vertical position

Converting docx table into html (keeping all formatting) or an image to use in html

I've used python-docx to create some tables using a specified style format in my docx file. I now need to use these tables with this same formatting. Is there a way I can either convert the table including all of the formatting and styles, colours etc. to html? Or failing that a simple (automated) way of making the table into a figure which could be used?

To covert Docx to HTML use below code:
Below code do not identify the tables and images from docx.It convert docx to html but not preserve tables and images.
import mammoth
Docx = open("docx_file.docx", 'rb')
html = open('html_filename.html', 'wb')
document = mammoth.convert_to_html(Docx )
html.write(document.value.encode('utf8'))
Docx.close()
html.close()
To keep formatting and images use win32 package for converting docx to html.
import win32com.client
doc = win32com.client.GetObject ("docx_InputFile.docx")
doc.SaveAs (FileName="Html_FileName.html", FileFormat=8)
doc.Close ()

I can't find suitable solution, that supports conversion with formatting and styles. But you may try to convert docx to jpg by using this: DOCX to JPG API. Python library and snippets for this service are here: ConvertAPI/convertapi-python

how write hyperlink to local picture into the cell in openpyxl?

I use Python 2.7.3
I need to write hyperlink to local picture into the cell by openpyxl library.
when I need add hyperlink to web site I write something like this:
from openpyxl import Workbook
wb = Workbook()
dest_filename = r'empty_book.xlsx'
ws = wb.worksheets[0]
ws.title = 'Name'
hyperlink to local picture
ws.cell('B1').hyperlink = ('http://pythonhosted.org/openpyxl/api.html')
hyperlink to local picture
ws.cell('B2').hyperlink = ('1.png') # It doesn't work!
wb.save(filename = dest_filename)
I have 3 question:
how we can write hyperlink like VBA's style function:
ActiveCell.FormulaR1C1 = _
"=HYPERLINK(""http://stackoverflow.com/questions/ask"",""site"")"
with hyherlink and her name
how we can write hyperlink to local image?
ws.cell('B2').hyperlink = ('1.png') # It doesn't work! And I don't now what to do )
Plese, help me )
Can we use unicode hyperlinks to image? for example when I use
ws.cell('B1').hyperlink =
(u'http://pythonhosted.org/openpyxl/api.html') It fail with error!
for example we have picture 'russian_language_name.png' and we
create hyperlink in exel without any problem. We click to the cell,
and then print
'=Hyperlink("http://stackoverflow.com/questions/ask";"site_by_russian_language")
save document, unzip him. Then we go to him directory to xl->worksheets->sheet1.xml
and we see the title
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
and then ...
row r="2" x14ac:dyDescent="0.25" spans="2:6">-<c r="B2" t="str" s="1"><f>HYPERLINK("http://stackoverflow.com/questions/ask","site_by_russian_language")</f><v>site_by_russian_language</v></c>
everything ok =) Exel supports unicode, but what about python's library openpyxl? It support the unicode in hyperlinks ?

As the files in the .xlsx file are XML files with UTF-8 encoding, Unicode hyperlinks are not a problem.

About Question 2, you need to include the full path of the file link, i think.
If you cannot access the file link in your Excel file, it's the security strategy of Excel that prohibits such actions.

I answered a similar question. Hope this helps.
Well, I could arrive at this. While there is no direct way to build a hyperlink, in your case we could do this way. I was able to build a hyperlink to an existing file using the below code.
wb=openpyxl.Workbook()
s = wb.get_sheet_by_name('Sheet')
s['B4'].value = '=HYPERLINK("C:\\Users\\Manoj.Waghmare\\Desktop\\script.txt", "newfile")'
s['B4'].style = 'Hyperlink'
wb.save('trial.xlsx')
By mentioning the style attribute as 'Hyperlink' is the key. All other code I have may not be of any much importance to you. style attribute would otherwise have a value of 'Normal' Strange thing is even without the style attribute, the hyperlink we working but just that it was lacking style! of course. Though strange, I have seen stranger things. Hope this helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - convert docx to HTML including Fonts and Fonts Size - python

Related

I can´t insert tab stops into a docx generated from html

How to convert Web PDF to Text

python - read pdf ignoring header and footer

Converting docx table into html (keeping all formatting) or an image to use in html

how write hyperlink to local picture into the cell in openpyxl?

Categories

Resources