PyPdf does not read the pdf text line by line - python

I was using PyPdf to read text from a pdf file. However pyPDF does not read the text in pdf line by line, Its reading in some haphazard manner. Putting new line somewhere when its not even present in the pdf.
import PyPDF2
pdf_path = r'C:\Users\PDFExample\Desktop\Temp\sample.pdf'
pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page_nos = pdfReader.numPages
for i in range(page_nos):
# Creating a page object
pageObj = pdfReader.getPage(i)
# Printing Page Number
print("Page No: ",i)
# Extracting text from page
# And splitting it into chunks of lines
text = pageObj.extractText().split(" ")
# Finally the lines are stored into list
# For iterating over list a loop is used
for i in range(len(text)):
# Printing the line
# Lines are seprated using "\n"
print(text[i],end="\n\n")
print()
This gives me content as
Our Ref :
21
1
8
88
1
11
5
Name:
S
ky Blue
Ref 1 :
1
2
-
34
-
56789
-
2021/2
Ref 2:
F2021004
444
Amount:
$
1
00
.
11
...
Whereas expected was
Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane
Here is the link to the pdf file
https://pdfhost.io/v/eCiktZR2d_sample2

I tried a different package called as pdfplumber. It was able to read the pdf line by line in exact way in which I wanted.
1. Install the package pdfplumber
pip install pdfplumber
2. Get the text and store it in some container
import pdfplumber
pdf_text = None
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
pdf_text = first_page.extract_text()

Related

Extract a whole section from 1000 PDF files to Excel

How to extract only section 5 from 1000 PDF files to excel.
Each PDF files would have 50 pages or over 100 pages. Section 5 would have two lines or even up to several pages. The unique identifier for section 5 would be started from title of Section 5 and ended before Section 6. All those 1000 PDF files would be saved in one folder on desktop.
All text can be copy and paste and recognizable, no OCR is needed.
PDF Files are in the following format.
Section 1
xxxxx
Section 2
YYYY
Section 3
UUUUU
Section 4
OOOOO
Section 5
PPPPP
PPP
PPPP
Section 6
GGGG
Result table is expected to be following format:
| File Name | Section 5 |
| -------- | -------- |
| File 1 | P... |
| File 2 | PP... |
| File 3 | PPPP... |
| File 4 | PP |
....
| File 1000 | PPP... |
There are at least a couple ways to do this kind of thing. See the code below.
# Loop through all PDF files in a folder
# if your file names is like file1.pdf, file2.pdf, and ... then you may use a for loop:
import PyPDF2
import re
for k in range(1,100):
# open the pdf file
object = PyPDF2.PdfFileReader("C:/Users/Excel/Desktop/Coding/Python/PDF Files/Loop Through Multiple PDF Files and Search For Specific Text in Each/pdf_files/file%s.pdf"%(k))
# get number of pages
NumPages = object.getNumPages()
# define keyterms
String = "New York State Real Property Law"
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
print(ResSearch)
# Loop through all PDF files in a folder
# otherwise you can walk through your folder using the os module
import PyPDF2
import re
import os
for foldername,subfolders,files in os.walk(r"C:/Users/Excel/Desktop/Coding/Python/PDF Files/Loop Through Multiple PDF Files and Search For Specific Text in Each/pdf_files/"):
for file in files:
# open the pdf file
object = PyPDF2.PdfFileReader(os.path.join(foldername,file))
# get number of pages
NumPages = object.getNumPages()
# define keyterms
String = "New York State Real Property Law"
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
print(ResSearch)

PDF range split

I am trying to split a PDF file by finding a key word of text and then grabbing that page the key word is on and the following 4 pages after, so total of 5 pages, and splitting them from that original PDF and putting them into their own PDF so the new PDF will have those 5 pages only, then loop through again find that key text again because its repeated further down the original PDF X amount of times, grabbing that page plus the 4 after and putting into its own PDF.
Example: key word is found on page 7 the first loop so need page 7 and also pages 8-11 and put those 5 pages 7-11 into a pdf file,
the next loop they key word is found on page 12 so need page 12 and pages 13-16 so pages 12-16 split onto their own pdf at this point it has created 2 separate pdfs
the below code finds the key word and puts it into its own pdf file but only got it for that one page not sure how to include the range
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
path = "example.pdf"
fname = os.path.basename(path)
reader = PdfFileReader(path)
for page_number in range(reader.getNumPages()):
writer = PdfFileWriter()
writer.addPage(reader.getPage(page_number))
text = reader.getPage(page_number).extractText()
text_stripped = text.replace("\n", "")
print(text_stripped)
if text_stripped.find("Disregarded Branch") != (-1):
output_filename = f"{fname}_page_{page_number + 1}.pdf"
with open(output_filename, "wb") as out:
writer.write(out)
print(f"Created: {output_filename}")
disclaimer: I am the author of borb, the library used in this answer.
I think your question comes down to 2 common functionalities:
find the location of a given piece of text
merge/split/extract pages from a PDF
For the first part, there is a good tutorial in the examples repo.
You can find it here. I'll repeat one of the examples here for completeness.
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def main():
# read the Document
doc: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleTextExtraction()
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# check whether we have read a Document
assert doc is not None
# print the text on the first Page
print(l.get_text_for_page(0))
if __name__ == "__main__":
main()
This example extracts all the text from page 0 of the PDF. of course you could simply iterate over all pages, and check whether a given page contains the keyword you're looking for.
For the second part, you can find a good example in the examples repository. This is the link. This example (and subsequent example) takes you through the basics of frankensteining a PDF from various sources.
The example I copy/paste here will show you how to build a PDF by alternatively picking a page from input document 1, and input document 2.
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
import typing
from decimal import Decimal
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
def main():
# open doc_001
doc_001: typing.Optional[Document] = Document()
with open("output_001.pdf", "rb") as pdf_file_handle:
doc_001 = PDF.loads(pdf_file_handle)
# open doc_002
doc_002: typing.Optional[Document] = Document()
with open("output_002.pdf", "rb") as pdf_file_handle:
doc_002 = PDF.loads(pdf_file_handle)
# create new document
d: Document = Document()
for i in range(0, 10):
p: typing.Optional[Page] = None
if i % 2 == 0:
p = doc_001.get_page(i)
else:
p = doc_002.get_page(i)
d.append_page(p)
# write
with open("output_003.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, d)
if __name__ == "__main__":
main()
You've almost got it!
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def create_4page_pdf(base_pdf_path, start):
reader = PdfFileReader(base_pdf_path)
writer = PdfFileWriter()
for i in range(4):
index = start + i
if index < len(reader.pages):
page = reader.pages[index]
writer.addPage(page)
fname = os.path.basename(base_pdf_path)
output_filename = f"{fname}_page_{start + 1}.pdf"
with open(output_filename, "wb") as out:
writer.write(out)
print(f"Created: {output_filename}")
def main(base_pdf_path="example.pdf"):
base_pdf_path = "example.pdf"
reader = PdfFileReader(base_pdf_path)
for page_number, page in enumerate(reader.pages):
text = page.extractText()
text_stripped = text.replace("\n", "")
print(text_stripped)
if text_stripped.find("Disregarded Branch") != (-1):
create_4page_pdf(base_pdf_path, page_number)

Control Breake Logic Using Date

I am having a hard time figuring out how to read multiple lines of a text file and produce the required output. I tried the datetime format but it is not going anywhere. I would greatly appreciate any help
What is being asked is:
Write Photo Report - For this part of the project we will take a data file and use Control Break logic on it to produce a report to the screen. The control break will be on the Year the file was created. The data file will have the following format (one field per line):
Date Created (in the form DD-MM-YYYY) Filename
Number of Bytes
For example, the following input file:
25-02-2019
MyTurtle.GIF
6000
11-05-2019
Smokey.GIF
4000
I am not able read and output the date in the file. What I currently have is:
def openFile(self):
myFile = self.inputFile.getText()
fileName = open(myFile, "r")
text = fileName.readline()
x = "%4s%25s%25s\n\n" % ("File Name", "Date Created", "Number of Bytes")
date_str_format = '%Y-%m-%d'
jobs = []
for i in fileName:
d = datetime.strptime(i[0],'%m-%d-%Y')
if d in i:
date = i
x += "%4d\n" % date
You can use the built-in module re to extract all the file names, dates and bytes from your file:
import re
from datetime import datetime
with open('file.txt', 'r') as f:
t = f.read()
dates = re.findall('\d\d-\d\d-\d\d\d\d', t) # Find all the dates in the form of 00-00-0000
files = re.findall('\w+\.\w+', t) # Find all the file names in the form of text.text
byte = re.findall('(?<!-)\d\d\d\d', t) # Find all the number of bytes in the for of a four digit number without a dash behind it
print("File Name".ljust(15), "Date Created".ljust(15), "Number of Bytes".ljust(15))
for d, f, b in zip(dates, files, byte):
print(f.ljust(15), d.ljust(15), b.ljust(15))
Output:
File Name Date Created Number of Bytes
MyTurtle.GIF 25-02-2019 6000
Smokey.GIF 11-05-2019 4000
For this, you need to read all the lines in the file then process 3 lines at a time to generate a data row for the report.
Try this code:
from datetime import datetime
ss = '''
25-02-2019
MyTurtle.GIF
6000
11-05-2019
Smokey.GIF
4000
'''.strip()
with open ('myfile.txt','w') as f: f.write(ss) # write data file
#############################
def openFile(self):
fileName = open('myFile.txt', "r")
x = "{:<20}{:<25}{:<25}".format("File Name", "Date Created", "Number of Bytes")
print(x) # header row
lines = fileName.readlines() # all lines in file
for i in range(0,len(lines),3): # index every 3 lines
dt = lines[i].strip() # date
filename = lines[i+1].strip() # file name
filesize = lines[i+2].strip() # file size
d = datetime.strptime(dt,'%d-%m-%Y') # format date
x = "{:<20}{:<25}{:<25}".format(filename, str(d), str(filesize)) # data row
print(x)
openFile(None)
Output
File Name Date Created Number of Bytes
MyTurtle.GIF 2019-02-25 00:00:00 6000
Smokey.GIF 2019-05-11 00:00:00 4000

How can I select a certain string in txt file and list it in csv file?

Here is my content in my text file: and I only want to get this sha1 and description then parse it to a csv file using prefix and delimiter a trimed the strings then selected the sha1 between "\" and "->" then I want to get the description.
+----------------------------------------------------+
| VSCAN32 Ver 2.00-1655 |
| |
| Copyright (c) 1990 - 2012 xxx xxx xxx Inc. |
| |
| Maintained by xxxxxxxxx QA for VSAPI Testing |
+----------------------------------------------------+
Setting Process Priority to NORMAL: Success 1
Successfully setting POL Flag to 0
VSGetVirusPatternInformation is invoked
Reading virus pattern from lpt$vpn.527 (2018/09/25) (1452700)
Scanning samples_extracted\88330686ae94a9b97e1d4f5d4cbc010933f90f9a->(MS Office 2007 Word 4045-1)
->Found Virus [TROJ_FRS.VSN11I18]
Scanning samples_extracted\8d286d610f26f368e7a18d82a21dd68b68935d6d->(Microsoft RTF 6008-0)
->Found Virus [Possible_SMCCVE20170199]
Scanning samples_extracted\a10e5f964eea1036d8ec50810f1d87a794e2ae8c->(ASCII text 18-0)
->Found Virus [Trojan.VBS.NYMAIM.AA]
18 files have been checked.
Found 16 files containing viruses.
(malloc count, malloc total, free total) = (0, 35, 35)
So far this is my code: it still outputs many string but i only need the sha1 and description to be parsed in csv I used split so the sha1 can be selected between "\" and "->" it does put the sha1 but the description is not trimed, and the contents are still there
import csv
INPUTFILE = 'input.txt'
OUTPUTFILE = 'output.csv'
PREFIX = '\\'
DELIMITER = '->'
def read_text_file(inputfile):
data = []
with open(inputfile, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.rstrip('\n')
if not line == '':
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
data.append(parts)
return data
def write_csv_file(data, outputfile):
with open(outputfile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"',
quoting=csv.QUOTE_ALL)
for row in data:
csvwriter.writerow(row)
def main():
data = read_text_file(INPUTFILE)
write_csv_file(data, OUTPUTFILE)
if __name__ == '__main__':
main()
Here is what I want in my csv: sha1 and description, but my output file dispplays the whole text file, but it filtered the sha1 and putted it in a column
EDIT: At first it was working but this line of text can be placed in the csv file because of it's multiple lines, any answer please?
Scanning samples_extracted\0191a23ee122bdb0c69008971e365ec530bf03f5
- Invoice_No_94497.doc->Found Virus [Trojan.4FEC5F36]->(MIME 6010-0)
- Found 1/3 Viruses in samples_extracted\0191a23ee122bdb0c69008971e365ec530bf03f5
With the minimum changes - you can use this part of code:
for line in lines:
line = line.rstrip('\n')
if not line == '' and DELIMITER in line and not "Found" in line: # <---
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
But I would prefer to use regex:
import re
for line in lines:
line = line.rstrip('\n')
if re.search(r'[a-zA-Z0-9]{40}->\(', line): # <----
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
data.append(parts)
The result will be:
cat output.csv
"88330686ae94a9b97e1d4f5d4cbc010933f90f9a","(MS Office 2007 Word 4045-1)"
"8d286d610f26f368e7a18d82a21dd68b68935d6d","(Microsoft RTF 6008-0)"
"a10e5f964eea1036d8ec50810f1d87a794e2ae8c","(ASCII text 18-0)"
import re
import pandas as pd
a=open("inputfile","a+")
storedvalue=[]
for text in a.readlines():
matched_words=(re.search(r'\d.+?->\(.*?\)',text))
if matched_words!=None:
matched_words=matched_words.group()
matched_words=matched_words.split("->")
storedvalue.append(tuple(matched_words))
dataframe=pd.DataFrame(storedvalue,columns=["hashvalue","description"])
dataframe.to_csv("output.csv")
The result will be:
hashvalue description
88330686ae94a9b97e1d4f5d4cbc010933f90f9a (MS Office 2007 Word 4045-1)
8d286d610f26f368e7a18d82a21dd68b68935d6d (Microsoft RTF 6008-0)
10e5f964eea1036d8ec50810f1d87a794e2ae8c (ASCII text 18-0)

python replace value string with number

im building a system to sort some .log files in .txt format so that I later can send it to excel. There is 70+ files and in every file im scanning for a keyword, I get 1000+ strings that I want to save in a .txt. I can get every string that I want and see from which .log file each log has ben taken, but now I want to rename the file that the .log came from with a corresponding number 1,2,3,4,5...(one number for every file instead for its file name). code:
import glob
def LogFile(filename, tester):
message = []
data = []
print(filename)
with open(filename) as filesearch: # open search file
filesearch = filesearch.readlines() # read file
i = 1
d = {}
for filename in filename:
if not d.get(filename, False):
d[filename] = i
i += 1
for line in filesearch:
if tester in line: # extract ""
start = '-> '
end = ':\ '
number = line[line.find(start)+3: line.find(end)] #[ord('-> '):ord(' :\ ')]
data.append(number) # store all found wors in array
text = line[line.find(end)+3:]
message.append(text)
with open('Msg.txt', 'a') as handler: # create .txt file
for i in range(len(data)):
handler.write(f"{i}|{data[i]}|{message[i]}")
# open with 'w' to "reset" the file.
with open('Msg.txt', 'w') as file_handler:
pass
# ---------------------------------------------------------------------------------
for filename in glob.glob(r'C:\Users\FKAISER\Desktop\AccessSPA\SPA\*.log'):
LogFile(filename, 'Sending Request: Tester')
I have tried using this function
i = 1
d = {}
for filename in filename:
if not d.get(filename, False):
d[filename] = i
i += 1
but then it looks like this
i want each file to have the same number as in the picture, 1 indicates for all the 26 logs and 2 indicates for the 10 file in that folder... etc

Categories