I'm having a trouble while scraping text from pdf files using python.
My goal is to get the text from a pdf file ( from chapter 1 to chapter 2, for example) and write it on a docx file(or txt file). However, the text I get is full of incorrect spacing.
Text example: "
Chapter 1
Aerial seeding can quickly cover large and physically inaccessible areas1 to improve soil quality and scavenge residual nitrogen in agriculture2, and for postfire reforestation3,4,5 and wildland restoration6,7. However, it suffers from low germination rates, due to the direct exposure of unburied
Chapter 2
"
Text output on docx file: "
Chapter 1
Aerial seed ing can quic kly cover large and phys ically inacce ssible a reas1 to improve soil quality and scavenge residu al nitrogen in agriculture2, and for postf ire refore station3,4,5 and wil dland restorati on6,7. However, it suffers from low germina tion rates, due to the direct expos ure of unbur ied
Chapter 2
"
Notice that words are incorretly spaced.
My code is the following:
import PyPDF2
# Open the PDF file
with open('example.pdf', 'rb') as pdf_file:
# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Initialize an empty string to store the text
text = ""
# Loop through each page in the PDF file
for page in range(pdf_reader.getNumPages()):
# Extract the text from the page
page_text = pdf_reader.getPage(page).extractText()
# Append the page text to the overall text
text += page_text
# Stop extracting text after the end of chapter 1
if "CAPĂTULO II" in page_text:
break
# Extract the text from Chapter 1 to Chapter 2
start = text.find("Chapter 1")
end = text.find("Chapter 2")
text = text[start:end]
print(text)
# Save the extracted text to a new file
with open('extracted_text.txt', 'w') as text_file:
text_file.write(text)
The expected output is the first text exactly as it is.
How can I solve this case?
If I drag a PDF onto a shortcut (One I prepared earlier)
Then it will automatically generate the text no mess no fuss just a right click or a drag and drop, HOWEVER it needs PDFtoTEXT (Xpdf or poppler, either willl do) as the Automator
C:\Windows\System32\cmd.exe /c C:\Apps\PDF\xpdf\xpdf-tools-win-4.04\bin32\pdftotext.exe -layout -nopgbrk -enc UTF-8
Once you have the layout text you can slice and dice it with Python, or MS NotePad with VBS or any other richer editor like WordPad
using the tips to use the PyMuPDF I managed to get it. I used the code bellow
import fitz
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
# Open the PDF file
doc = fitz.open('Regulamento-GalapagosDeepOceanFICFIM06022023VF.pdf')
# Create a new Word document
document = Document()
# Set the font style for the document
font_style = document.styles['Normal']
font_style.font.size = Pt(12)
text=""
text_aux=""
# Iterate over each page in the PDF file
for page in doc:
# Extract the text from the page
text = page.get_text()
# Add the text to a new paragraph in the Word document
#paragraph = document.add_paragraph(text)
# Set the paragraph alignment to left
#paragraph.alignment = WD_ALIGN_PARAGRAPH.LEFT
text_aux = text_aux + text
# Save the Word document
start = text_aux.find("CHAPTER 1")
end = text_aux.find("CHAPTER 2")
text1 = text_aux[start:end]
paragraph = document.add_paragraph(text1)
paragraph.alignment = WD_ALIGN_PARAGRAPH.LEFT
print(text1)
document.save('example_test.docx')
Using this code I was able to get the text with the exact format and extract the chapters as wanted.
Related
I want to automatically turn pdf files into text, and then take that output to save a file on my desktop.
Example:
-- pdf converted text: "HELLO WORLD"
-- save file on desktop on a .txt file with "HELLO WORLD" saved.
I have done:
fp = open('/Users/zain/Desktop', 'pdf_summary')
fp.write(text)
I thought this would save my file on the desktop given the input (text) which I used as the variable to house the converted text.
Full Code:
from PyPDF2 import PdfReader
reader = PdfReader("/Users/zain/Desktop/Week2_POL305_Manfieldetal.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
print(text)
fp = open('/Users/zain/Desktop', 'pdf_summary')
fp.write(text)
fp.write(text)
PDF may consist of all sorts of things, not only text.
You therefore have to explicitly extract text from a PDF - if that is what you want.
In package PyMuPDF you could do it this way:
import fitz # import pymupdf
import pathlib
doc=fitz.open("input.pdf")
text = "\n".join([page.get_text() for page in doc])
pathlib.Path("input.txt").write_bytes(text.encode()) # supports non ASCII text
This works for me.
from PyPDF2 import PdfReader
#path to pdf file
reader=PdfReader(r'C:\Users\zain\Desktop\Week2_POL305_Manfieldetal.pdf')
text = ""
for page in reader.pages:
text += page.extract_text() + '\n'
#path to save file on desktop
#you can keep txt, leave nothing, or change it to another file type
fp = open(r'C:\Users\zain\Desktop\pdf_summary.txt','a')
fp.writelines(text)
I want to create a document with serial barcode (10 per page) so I am using the code bellow to generate a barcode:
my_code = Code128(document)
my_code.save(document)
The result of this code is an svg picture...So I want to insert this svg picture into a table in docx file and I am using this piece of code for that:
doc = Document('assets/test.docx')
tables = doc.tables
p = tables[0].rows[1].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('BL22002222.svg', width=Inches(3), height=Inches(1))
doc.save('assets/test.docx')
but it through this error:
docx.image.exceptions.UnrecognizedImageError
I want to generate a printed pages pdf or docx i don't care the most important is that it contain barcode
I've been looking at some of the documentation, but all of the work I've seen around docx is primarily directed towards working with text already in a word document. What I'd like to know, is is there a simple way to take text either from HTML or a Text document, and import that into a word document, and to do that wholesale? with all of the text in the HTML/Text document? It doesn't seem to like the string, it's too long.
My understanding of the documentation, is that you have to work with text on a paragraph by paragraph basis. The task that I'd like to do is relatively simple - however it's beyond my python skills. I'd like to set up the margins on the word document, and then import the text into the word document so that it adheres to the margins that I previously specified.
Does anyone have any-thoughts? None of the previous posts have been very helpful that I have found.
import textwrap
import requests
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Inches
class DocumentWrapper(textwrap.TextWrapper):
def wrap(self, text):
split_text = text.split('\n\n')
lines = [line for para in split_text for line in textwrap.TextWrapper.wrap(self, para)]
return lines
page = requests.get("http://classics.mit.edu/Aristotle/prior.mb.txt")
soup = BeautifulSoup(page.text,"html.parser")
#we are going to pull in the text wrap extension that we have added.
#The typical width that we want tow
text_wrap_extension = DocumentWrapper(width=82,initial_indent="",fix_sentence_endings=True)
new_string = text_wrap_extension.fill(page.text)
final_document = "Prior_Analytics.txt"
with open(final_document, "w") as f:
f.writelines(new_string)
document = Document(final_document)
### Specified margin specifications
sections = document.sections
for section in sections:
section.top_margin = (Inches(1.00))
section.bottom_margin = (Inches(1.00))
section.right_margin = (Inches(1.00))
section.left_margin = (Inches(1.00))
document.save(final_document)
The error that I get thrown is below:
docx.opc.exceptions.PackageNotFoundError: Package not found at 'Prior_Analytics.txt'
This error simply means there is no .docx file at the location you specified.. So you can modify your code to create the file it it doesnt exist.
final_document = "Prior_Analytics.txt"
with open(final_document, "w+") as f:
f.writelines(new_string)
You are providing a relative path. How do you know what Python's current working directory is? That's where the relative path you give will start from.
A couple lines of code like this will tell you:
import os
print(os.path.realpath('./'))
Note that:
docx is used to open .docx files
I got it.
document = Document()
sections = document.sections
for section in sections:
section.top_margin = Inches(2)
section.bottom_margin = Inches(2)
section.left_margin = Inches(2)
section.right_margin = Inches(2)
document.add_paragraph(###Add your text here. Add Paragraph Accepts text of whatever size.###)
document.save()#name of document goes here, as a string.
I have to convert whole pdf to text. i have seen at many places converting pdf to text but particular page.
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
page = pdf.getPage(0)
text = page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
How to convert whole pdf file to text without using getpage()??
You may want to use textract as this answer recommends to get the full document if all you want is the text.
If you want to use PyPDF2 then you can first get the number of pages then iterate over each page such as:
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
text = ""
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
text += page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
Though you may want to remember which page the text came from in which case you could use a list:
page_text = []
for page_num in range(pdf.getNumPages()): # For each page
page = pdf.getPage(page_num) # Get that page's reference
page_text.append(page.extractText()) # Add that page to our array
for page in page_text:
print(page) # print each page
You could use tika to accomplish this task, but the output needs a little cleaning.
from tika import parser
parse_entire_pdf = parser.from_file('mypdf.pdf', xmlContent=True)
parse_entire_pdf = parse_entire_pdf['content']
print (parse_entire_pdf)
This answer uses PyPDF2 and encode('utf-8') to keep the output per page together.
from PyPDF2 import PdfFileReader
def pdf_text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# Get total pdf page number.
totalPageNumber = pdf.numPages
currentPageNumber = 0
while (currentPageNumber < totalPageNumber):
page = pdf.getPage(currentPageNumber)
text = page.extractText()
# The encoding put each page on a single line.
# type is <class 'bytes'>
print(text.encode('utf-8'))
#################################
# This outputs the text to a list,
# but it doesn't keep paragraphs
# together
#################################
# output = text.encode('utf-8')
# split = str(output, 'utf-8').split('\n')
# print (split)
#################################
# Process next page.
currentPageNumber += 1
path = 'mypdf.pdf'
pdf_text_extractor(path)
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
PDF is a page-oriented format & therefore you'll need to deal with the concept of pages.
What makes it perhaps even more difficult, you're not guaranteed that the text excerpts you're able to extract are extracted in the same order as they are presented on the page: PDF allows one to say "put this text within a 4x3 box situated 1" from the top, with a 1" left margin.", and then I can put the next set of text somewhere else on the same page.
Your extractText() function simply gets the extracted text blocks in document order, not presentation order.
Tables are notoriously difficult to extract in a common, meaningful way... You see them as tables, PDF sees them as text blocks placed on the page with little or no relationship.
Still, getPage() and extractText() are good starting points & if you have simply formatted pages, they may work fine.
I found out a very simple way to do this.
You have to follow this steps:
Install PyPDF2 :To do this step if you use Anaconda, search for Anaconda Prompt and digit the following command, you need administrator permission to do this.
pip install PyPDF2
If you're not using Anaconda you have to install pip and put its path
to your cmd or terminal.
Python Code: This following code shows how to convert a pdf file very easily:
import PyPDF2
with open("pdf file path here",'rb') as file_obj:
pdf_reader = PyPDF2.PdfFileReader(file_obj)
raw = pdf_reader.getPage(0).extractText()
print(raw)
I just used pdftotext module to get this done easily.
import pdftotext
# Load your PDF
with open("test.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# creating a text file after iterating through all pages in the pdf
file = open("test.txt", "w")
for page in pdf:
file.write(page)
file.close()
Link: https://github.com/manojitballav/pdf-text
I have tons of commercial invoices to work with, in PDF format. Some information such the billing party, transaction occurred date and amount of money are needed to be picked.
In another word, I need to copy these information from each commercial invoice and paste them into an Excel spreadsheet.
These information are all at the same position on the PDF document, always the same place on each PDF.
Is there a way that I can have Python to pick up the information and store them into Excel spreadsheet, instead of manually copy&paste?
Thanks.
to read the pdf file you can use StringIO
from StringIO import StringIO
pdfContent = StringIO(getPDFContent("Billineg.pdf").encode("ascii", "ignore"))
for line in pdfContent:
print line
other option you can use pypdf
small example
from pyPdf import PdfFileReader
input1 = PdfFileReader(file("Billineg.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
after extracting data you can write them into csv or for excel you can use xlwt
getpdf content is method
import pyPdf
def getPDFContent(path):
content = ""
num_pages = 10
p = file(path, "rb")
pdf = pyPdf.PdfFileReader(p)
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content