I want to convert web PDF's such as - https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000's of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my PC. Any Python Code Solutions to this?
Thanks
There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :
pdfplumber
tesseract
pdftotext
...
Here is a simple code example for that (using pdfplumber)
from urllib.request import urlopen
import pdfplumber
url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
response = urlopen(url)
file = open("img.pdf", 'wb')
file.write(response.read())
file.close()
try:
pdf = pdfplumber.open('img.pdf')
except:
# Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
print(f'Error. Are you sure this is a PDF ?')
continue
#PDF plumber text extraction
page = pdf.pages[0]
text = page.extract_text()
EDIT : My bad, just realised you asked "without saving it to my PC".
That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as "img.pdf" so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(
Related
I'm trying to convert a file from Docx to HTML with font family, fonts size and colors in Python, I tried couple of solutions i.e Python docx, docx2html, Python Mammoth.
but none of the packages works for me. these packages are converting to HTML, but many things related to styles i.e fonts, size, and colors are skipped.
I tried to open and read docx files using Python zipfile and get XML of word file, I got all the docx information in XML, so now I'm thinking of parsing XML to HTML in Python, Maybe I can find any parser for this purpose.
Here's the snippet of code that I tried with Python docx but I'm getting None values here.
d = Document('1.docx')
d_styles = d.styles
for key in d_styles:
print(f'{key} : {d_styles[key]}')
for XML using zipfile here's my code snippet.
docx = zipfile.ZipFile(path)
content = docx.read('word/document.xml').decode('utf-8')
Any help will be highly appreciated.
I used PyPDF2 to extract text from a PDF file.
This is the code I wrote.
import PyPDF2 as pdf
file = open("file_to_scrape.pdf",'rb')
pdf_reader = pdf.PdfFileReader(file)
page_1 = pdf_reader.getPage(0)
print(page_1.extractText())
This gave out the following output.
˜˚Power: An Enabler for Industrialization
and Regional Cooperation
˜˚.˜ Introduction
The weird characters behind Power and Introduction are supposed to be numbers, 15 and 15.1 to be precise.
I copied them and tried to encode them to utf-8, but this is what I got.
b'\xcb\x9c\xcb\x9aPower: An Enabler for Industrialization and Regional Cooperation\xcb\x9c\xcb\x9a.\xcb\x9c Introduction'
THis is how the page looks like
Could someone please help in figuring out this issue? My aim is to extract list of all figures, headings in the PDF along with their numbering
I have a pdf file that I am reading using pymupdf using the below syntax.
import fitz # this is pymupdf
with fitz.open('file.pdf') as doc:
text = ""
for page in doc:
text += page.getText()
Is there a way to ignore the header and footer while reading it?
I tried converting pdf to docx as it is easier to remove headers, but the pdf file I am working on is getting reformatted when I convert it to docx.
Is there any way pymupdf does this during the read?
The documentation has a page dedicated to this problem.
Define rectangle that omits the header
Use page.get_textbox(rect) method.
Source: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction#2-pageget_textboxrect
The generic solution that works for most pdf libraries is to
check for the size of the header/footer section in your pdf files
loop for each text in the document and check it's vertical position
I want to classify and analyze chapters and subchapters from a book in PDF format. So count the number of words and examine which word occurs how often and in which chapter.
pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader
# Creating a pdf file object
pdf = open('C:/Users/Dominik/Desktop/bsc/pdf1.pdf',"rb")
# creating pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf)
# checking number of pages in a pdf file
print(pdf_reader.numPages)
print(pdf_reader.getDocumentInfo())
# creating a page object
page = pdf_reader.getPage(0)
# finally extracting text from the page
print(page.extractText())
# Extracting entire PDF
for i in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(i)
a = str(1+pdf_reader.getPageNumber(page))
print (a)
page_content = page.extractText()
print (page_content)
# closing the pdf file
pdf.close()
this code already works. now I want to do more analysis like
store each chapter in its own variable and count the number of words.
In the end, everything should be stored in an excel file.
I tried something similar like this with CVs in PDF format. But all I came to know is the following:
PDF is an unstructured format. It is not possible to extract information from all the PDFs in a structured way. But if you know the structure of the books in PDF format, you can divide the Title of the chapters by using their unique identity like if they are written on BOLD or Italic format. This link can help you extract those information.
You can then traverse through the chapter till it hits the next chapter title.
As an absolute newbie on the topic of using python, I stumbled over a few difficulties using the newspaper library extension. My goal is to use the newspaper extension on a regular basis to download all new articles of a German news website called "tagesschau" and all articles from CNN to build a data stack I can analyze in a few years.
If I got it right I could use the following commands to download and scrape all articles into the python library.
import newspaper
from newspaper import news_pool
tagesschau_paper = newspaper.build('http://tagesschau.de')
cnn_paper = newspaper.build('http://cnn.com')
papers = [tagesschau_paper, cnn_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()`
If that's the right way to download all articles, so how I can extract and save those outside of python? Or saving those articles in python so that I can reuse them if I restart python again?
Thanks for your help.
The following codes will save the downloaded articles in HTML format. In the folder, you'll find. tagesschau_paper0.html, tagesschau_paper1.html, tagesschau_paper2.html, .....
import newspaper
from newspaper import news_pool
tagesschau_paper = newspaper.build('http://tagesschau.de')
cnn_paper = newspaper.build('http://cnn.com')
papers = [tagesschau_paper, cnn_paper]
news_pool.set(papers, threads_per_source=2)
news_pool.join()
for i in range (tagesschau_paper.size()):
with open("tagesschau_paper{}.html".format(i), "w") as file:
file.write(tagesschau_paper.articles[i].html)
Note: news_pool doesn't get anything from CNN, so I skipped to write codes for it. If you check cnn_paper.size(), it results to 0. You have to import and use Source instead.
The above codes can be followed as an example to save articles in other formats too, e.g. txt and also only parts that you need from the articles e.g. authors, body, publish_date.
You can use pickle to save objects outside of python and reopen them later:
file_Name = "testfile"
# open the file for writing
fileObject = open(file_Name,'wb')
# this writes the object news_pool to the
# file named 'testfile'
pickle.dump(news_pool,fileObject)
# here we close the fileObject
fileObject.close()
# we open the file for reading
fileObject = open(file_Name,'r')
# load the object from the file into var news_pool_reopen
news_pool_reopen = pickle.load(fileObject)