python - read pdf ignoring header and footer - python

I have a pdf file that I am reading using pymupdf using the below syntax.
import fitz # this is pymupdf
with fitz.open('file.pdf') as doc:
text = ""
for page in doc:
text += page.getText()
Is there a way to ignore the header and footer while reading it?
I tried converting pdf to docx as it is easier to remove headers, but the pdf file I am working on is getting reformatted when I convert it to docx.
Is there any way pymupdf does this during the read?

The documentation has a page dedicated to this problem.
Define rectangle that omits the header
Use page.get_textbox(rect) method.
Source: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction#2-pageget_textboxrect
The generic solution that works for most pdf libraries is to
check for the size of the header/footer section in your pdf files
loop for each text in the document and check it's vertical position

Related

python - convert docx to HTML including Fonts and Fonts Size

I'm trying to convert a file from Docx to HTML with font family, fonts size and colors in Python, I tried couple of solutions i.e Python docx, docx2html, Python Mammoth.
but none of the packages works for me. these packages are converting to HTML, but many things related to styles i.e fonts, size, and colors are skipped.
I tried to open and read docx files using Python zipfile and get XML of word file, I got all the docx information in XML, so now I'm thinking of parsing XML to HTML in Python, Maybe I can find any parser for this purpose.
Here's the snippet of code that I tried with Python docx but I'm getting None values here.
d = Document('1.docx')
d_styles = d.styles
for key in d_styles:
print(f'{key} : {d_styles[key]}')
for XML using zipfile here's my code snippet.
docx = zipfile.ZipFile(path)
content = docx.read('word/document.xml').decode('utf-8')
Any help will be highly appreciated.

How to convert Web PDF to Text

I want to convert web PDF's such as - https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000's of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my PC. Any Python Code Solutions to this?
Thanks
There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :
pdfplumber
tesseract
pdftotext
...
Here is a simple code example for that (using pdfplumber)
from urllib.request import urlopen
import pdfplumber
url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
response = urlopen(url)
file = open("img.pdf", 'wb')
file.write(response.read())
file.close()
try:
pdf = pdfplumber.open('img.pdf')
except:
# Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
print(f'Error. Are you sure this is a PDF ?')
continue
#PDF plumber text extraction
page = pdf.pages[0]
text = page.extract_text()
EDIT : My bad, just realised you asked "without saving it to my PC".
That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as "img.pdf" so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(

how can i classify the chapters of a pdf file and analyze the content per chapter?

I want to classify and analyze chapters and subchapters from a book in PDF format. So count the number of words and examine which word occurs how often and in which chapter.
pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader
# Creating a pdf file object
pdf = open('C:/Users/Dominik/Desktop/bsc/pdf1.pdf',"rb")
# creating pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf)
# checking number of pages in a pdf file
print(pdf_reader.numPages)
print(pdf_reader.getDocumentInfo())
# creating a page object
page = pdf_reader.getPage(0)
# finally extracting text from the page
print(page.extractText())
# Extracting entire PDF
for i in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(i)
a = str(1+pdf_reader.getPageNumber(page))
print (a)
page_content = page.extractText()
print (page_content)
# closing the pdf file
pdf.close()
this code already works. now I want to do more analysis like
store each chapter in its own variable and count the number of words.
In the end, everything should be stored in an excel file.
I tried something similar like this with CVs in PDF format. But all I came to know is the following:
PDF is an unstructured format. It is not possible to extract information from all the PDFs in a structured way. But if you know the structure of the books in PDF format, you can divide the Title of the chapters by using their unique identity like if they are written on BOLD or Italic format. This link can help you extract those information.
You can then traverse through the chapter till it hits the next chapter title.

Converting docx table into html (keeping all formatting) or an image to use in html

I've used python-docx to create some tables using a specified style format in my docx file. I now need to use these tables with this same formatting. Is there a way I can either convert the table including all of the formatting and styles, colours etc. to html? Or failing that a simple (automated) way of making the table into a figure which could be used?
To covert Docx to HTML use below code:
Below code do not identify the tables and images from docx.It convert docx to html but not preserve tables and images.
import mammoth
Docx = open("docx_file.docx", 'rb')
html = open('html_filename.html', 'wb')
document = mammoth.convert_to_html(Docx )
html.write(document.value.encode('utf8'))
Docx.close()
html.close()
To keep formatting and images use win32 package for converting docx to html.
import win32com.client
doc = win32com.client.GetObject ("docx_InputFile.docx")
doc.SaveAs (FileName="Html_FileName.html", FileFormat=8)
doc.Close ()
I can't find suitable solution, that supports conversion with formatting and styles. But you may try to convert docx to jpg by using this: DOCX to JPG API. Python library and snippets for this service are here: ConvertAPI/convertapi-python

Add Section to OpenDocument Text file with ODFpy

I am using Python2.7 and ODFpy to write an OpenDocument Text (ODT) file. Is there a way using the existing ODFpy API to add sections (a la Format->Sections...) to the document? Is there a way to import them from another document and then populate them, or to otherwise fetch the styling from another document?
A section can be added to a document by a method something like this:
from odf import text as odftext
from odf import opendocument
document = opendocument.OpenDocument()
document.text.addElement(odftext.Section(name="section1"))

Categories