Extract a whole section from 1000 PDF files to Excel

Extract a whole section from 1000 PDF files to Excel - python

How to extract only section 5 from 1000 PDF files to excel.
Each PDF files would have 50 pages or over 100 pages. Section 5 would have two lines or even up to several pages. The unique identifier for section 5 would be started from title of Section 5 and ended before Section 6. All those 1000 PDF files would be saved in one folder on desktop.
All text can be copy and paste and recognizable, no OCR is needed.
PDF Files are in the following format.
Section 1
xxxxx
Section 2
YYYY
Section 3
UUUUU
Section 4
OOOOO
Section 5
PPPPP
PPP
PPPP
Section 6
GGGG
Result table is expected to be following format:
| File Name | Section 5 |
| -------- | -------- |
| File 1 | P... |
| File 2 | PP... |
| File 3 | PPPP... |
| File 4 | PP |
....
| File 1000 | PPP... |

There are at least a couple ways to do this kind of thing. See the code below.
# Loop through all PDF files in a folder
# if your file names is like file1.pdf, file2.pdf, and ... then you may use a for loop:
import PyPDF2
import re
for k in range(1,100):
# open the pdf file
object = PyPDF2.PdfFileReader("C:/Users/Excel/Desktop/Coding/Python/PDF Files/Loop Through Multiple PDF Files and Search For Specific Text in Each/pdf_files/file%s.pdf"%(k))
# get number of pages
NumPages = object.getNumPages()
# define keyterms
String = "New York State Real Property Law"
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
print(ResSearch)
# Loop through all PDF files in a folder
# otherwise you can walk through your folder using the os module
import PyPDF2
import re
import os
for foldername,subfolders,files in os.walk(r"C:/Users/Excel/Desktop/Coding/Python/PDF Files/Loop Through Multiple PDF Files and Search For Specific Text in Each/pdf_files/"):
for file in files:
# open the pdf file
object = PyPDF2.PdfFileReader(os.path.join(foldername,file))
# get number of pages
NumPages = object.getNumPages()
# define keyterms
String = "New York State Real Property Law"
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
print(ResSearch)

Related

PyPdf does not read the pdf text line by line

I was using PyPdf to read text from a pdf file. However pyPDF does not read the text in pdf line by line, Its reading in some haphazard manner. Putting new line somewhere when its not even present in the pdf.
import PyPDF2
pdf_path = r'C:\Users\PDFExample\Desktop\Temp\sample.pdf'
pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page_nos = pdfReader.numPages
for i in range(page_nos):
# Creating a page object
pageObj = pdfReader.getPage(i)
# Printing Page Number
print("Page No: ",i)
# Extracting text from page
# And splitting it into chunks of lines
text = pageObj.extractText().split(" ")
# Finally the lines are stored into list
# For iterating over list a loop is used
for i in range(len(text)):
# Printing the line
# Lines are seprated using "\n"
print(text[i],end="\n\n")
print()
This gives me content as
Our Ref :
21
1
8
88
1
11
5
Name:
S
ky Blue
Ref 1 :
1
2
-
34
-
56789
-
2021/2
Ref 2:
F2021004
444
Amount:
$
1
00
.
11
...
Whereas expected was
Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane
Here is the link to the pdf file
https://pdfhost.io/v/eCiktZR2d_sample2

I tried a different package called as pdfplumber. It was able to read the pdf line by line in exact way in which I wanted.
1. Install the package pdfplumber
pip install pdfplumber
2. Get the text and store it in some container
import pdfplumber
pdf_text = None
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
pdf_text = first_page.extract_text()

How to split a PDF with PyMuPDF (with a loop)?

I'd like to use PyMuPDF : I'd like to split a pdf, with for each splitted file, a file named with the name of the bookmark, with only page
I've succefully my files, for exemple 4 PDF files for a 4 pages PDF source.... but in the several pdf, I don't have one page but with a random number of page ?
import sys, fitz
file = '/home/ilyes/Bulletins_Originaux.pdf'
bookmark = ''
try:
doc = fitz.open(file)
toc = doc.getToC(simple = True)
except Exception as e:
print(e)
for i in range(len(toc)):
documentPdfCible=toc[i][1]
documentPdfCibleSansSlash=documentPdfCible.replace("/","-")
numeroPage=toc[i][2]
pagedebut=numeroPage
pagefin=numeroPage + 1
print (pagedebut)
print (pagefin)
doc2 = fitz.open(file)
doc2.insertPDF(doc, from_page = pagedebut, to_page = pagefin, start_at = 0)
doc2.save('/home/ilyes/' + documentPdfCibleSansSlash + ".pdf")
doc2.close
Could you tell me what's wrong ?
Maybee because I use always "doc2" in the loop ?
Thanks you,
Abou Ilyès

Seems weird, that you open the same document twice.
You open your pdf file at doc = fitz.open(file) and again at doc2 = fitz.open(file).
Then you insert pages into the same file by doc2.insertPDF(doc, from_page = pagedebut, to_page = pagefin, start_at = 0).
Of course the doc files toc will get messed up completely by "randomly" inserting pages.
I recommend to replace doc2 = fitz.open(file) with doc2 = fitz.open()
This will create an empty "in memory" pdf (see the documentation), in which you can then insert the pages you need from doc. Then save this as a new pdf by its bookmark title by running
doc2.save('/home/ilyes/' + documentPdfCibleSansSlash + ".pdf")

Get information about files in a directory and print in a table

I'm stuck. I want to take a Windows directory that the user specifies, and list every file in that directory on a table with path, file name, file size, last modified time, and MD5 hash. For the life of me I can't figure out how to break it up in to individual files; it only does the entire path. I understand the path variable needs to be turned in to the various files within the directory, but I don't know how to do that.
How can I create the table accordingly, and add the MD5 hash columns. Last modified time should be a human readable format, not a UNIX timestamp.
#import libraries
import os
import time
import datetime
import logging
import hashlib
from prettytable import PrettyTable
import glob
#user input
path = input ("Please enter directory: ")
verbose = input ("Please enter yes/no for verbose: ")
print ("===============================================")
#processing input
if os.path.exists(path):
print("Processing directory: ", (path))
else:
print("Invalid directory.")
exit()
if (verbose) == ("yes"):
print("Verbose selected")
elif (verbose) == ("no"):
print("Verbose not selected")
else:
print("Invalid input")
print ("===============================================")
#process directory
directory = glob.glob(path)
filename = os.path.basename(path)
size = os.path.getsize(path)
modified = os.path.getmtime(path)
#output in to table
report = PrettyTable()
column_names = ['Path', 'File Name', 'File Size', 'Last Modified Time', 'MD5 Hash']
report.add_column(column_names[0], [directory])
report.add_column(column_names[1], [filename])
report.add_column(column_names[2], [size])
report.add_column(column_names[3], [modified])
report.sortby = 'File Size'
print (report)

Does this solution match your requirements? Using the builtin pathlib:
from pathlib import Path
from datetime import datetime
import hashlib
#...Your code getting path here...
directory = Path(path)
paths = []
filename = []
size = []
hashes = []
modified = []
files = list(directory.glob('**/*.*'))
for file in files:
paths.append(file.parents[0])
filename.append(file.parts[-1])
size.append(file.stat().st_size)
modified.append(datetime.fromtimestamp(file.stat().st_mtime))
with open(file) as f:
hashes.append(hashlib.md5(f.read().encode()).hexdigest())
#output in to table
report = PrettyTable()
column_names = ['Path', 'File Name', 'File Size', 'Last Modified Time', 'MD5 Hash']
report.add_column(column_names[0], paths)
report.add_column(column_names[1], filename)
report.add_column(column_names[2], size)
report.add_column(column_names[3], modified)
report.add_column(column_names[4], hashes)
report.sortby = 'File Size'
print(report)
Output:
+-------------------+------------------+-----------+----------------------------+----------------------------------+
| Path | File Name | File Size | Last Modified Time | MD5 Hash |
+-------------------+------------------+-----------+----------------------------+----------------------------------+
| C:\...\New folder | 1 - Copy (2).txt | 0 | 2019-12-05 15:35:31.562420 | d41d8cd98f00b204e9800998ecf8427e |
| C:\...\New folder | 1 - Copy (3).txt | 0 | 2019-12-05 15:35:31.562420 | d41d8cd98f00b204e9800998ecf8427e |
| C:\...\New folder | 1 - Copy.txt | 0 | 2019-12-05 15:35:31.562420 | d41d8cd98f00b204e9800998ecf8427e |
| C:\...\New folder | 1.txt | 0 | 2019-12-05 15:35:31.562420 | d41d8cd98f00b204e9800998ecf8427e |
+-------------------+------------------+-----------+----------------------------+----------------------------------+

How can I select a certain string in txt file and list it in csv file?

Here is my content in my text file: and I only want to get this sha1 and description then parse it to a csv file using prefix and delimiter a trimed the strings then selected the sha1 between "\" and "->" then I want to get the description.
+----------------------------------------------------+
| VSCAN32 Ver 2.00-1655 |
| |
| Copyright (c) 1990 - 2012 xxx xxx xxx Inc. |
| |
| Maintained by xxxxxxxxx QA for VSAPI Testing |
+----------------------------------------------------+
Setting Process Priority to NORMAL: Success 1
Successfully setting POL Flag to 0
VSGetVirusPatternInformation is invoked
Reading virus pattern from lpt$vpn.527 (2018/09/25) (1452700)
Scanning samples_extracted\88330686ae94a9b97e1d4f5d4cbc010933f90f9a->(MS Office 2007 Word 4045-1)
->Found Virus [TROJ_FRS.VSN11I18]
Scanning samples_extracted\8d286d610f26f368e7a18d82a21dd68b68935d6d->(Microsoft RTF 6008-0)
->Found Virus [Possible_SMCCVE20170199]
Scanning samples_extracted\a10e5f964eea1036d8ec50810f1d87a794e2ae8c->(ASCII text 18-0)
->Found Virus [Trojan.VBS.NYMAIM.AA]
18 files have been checked.
Found 16 files containing viruses.
(malloc count, malloc total, free total) = (0, 35, 35)
So far this is my code: it still outputs many string but i only need the sha1 and description to be parsed in csv I used split so the sha1 can be selected between "\" and "->" it does put the sha1 but the description is not trimed, and the contents are still there
import csv
INPUTFILE = 'input.txt'
OUTPUTFILE = 'output.csv'
PREFIX = '\\'
DELIMITER = '->'
def read_text_file(inputfile):
data = []
with open(inputfile, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.rstrip('\n')
if not line == '':
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
data.append(parts)
return data
def write_csv_file(data, outputfile):
with open(outputfile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"',
quoting=csv.QUOTE_ALL)
for row in data:
csvwriter.writerow(row)
def main():
data = read_text_file(INPUTFILE)
write_csv_file(data, OUTPUTFILE)
if __name__ == '__main__':
main()
Here is what I want in my csv: sha1 and description, but my output file dispplays the whole text file, but it filtered the sha1 and putted it in a column
EDIT: At first it was working but this line of text can be placed in the csv file because of it's multiple lines, any answer please?
Scanning samples_extracted\0191a23ee122bdb0c69008971e365ec530bf03f5
- Invoice_No_94497.doc->Found Virus [Trojan.4FEC5F36]->(MIME 6010-0)
- Found 1/3 Viruses in samples_extracted\0191a23ee122bdb0c69008971e365ec530bf03f5

With the minimum changes - you can use this part of code:
for line in lines:
line = line.rstrip('\n')
if not line == '' and DELIMITER in line and not "Found" in line: # <---
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
But I would prefer to use regex:
import re
for line in lines:
line = line.rstrip('\n')
if re.search(r'[a-zA-Z0-9]{40}->\(', line): # <----
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
data.append(parts)
The result will be:
cat output.csv
"88330686ae94a9b97e1d4f5d4cbc010933f90f9a","(MS Office 2007 Word 4045-1)"
"8d286d610f26f368e7a18d82a21dd68b68935d6d","(Microsoft RTF 6008-0)"
"a10e5f964eea1036d8ec50810f1d87a794e2ae8c","(ASCII text 18-0)"

import re
import pandas as pd
a=open("inputfile","a+")
storedvalue=[]
for text in a.readlines():
matched_words=(re.search(r'\d.+?->\(.*?\)',text))
if matched_words!=None:
matched_words=matched_words.group()
matched_words=matched_words.split("->")
storedvalue.append(tuple(matched_words))
dataframe=pd.DataFrame(storedvalue,columns=["hashvalue","description"])
dataframe.to_csv("output.csv")
The result will be:
hashvalue description
88330686ae94a9b97e1d4f5d4cbc010933f90f9a (MS Office 2007 Word 4045-1)
8d286d610f26f368e7a18d82a21dd68b68935d6d (Microsoft RTF 6008-0)
10e5f964eea1036d8ec50810f1d87a794e2ae8c (ASCII text 18-0)

python replace value string with number

im building a system to sort some .log files in .txt format so that I later can send it to excel. There is 70+ files and in every file im scanning for a keyword, I get 1000+ strings that I want to save in a .txt. I can get every string that I want and see from which .log file each log has ben taken, but now I want to rename the file that the .log came from with a corresponding number 1,2,3,4,5...(one number for every file instead for its file name). code:
import glob
def LogFile(filename, tester):
message = []
data = []
print(filename)
with open(filename) as filesearch: # open search file
filesearch = filesearch.readlines() # read file
i = 1
d = {}
for filename in filename:
if not d.get(filename, False):
d[filename] = i
i += 1
for line in filesearch:
if tester in line: # extract ""
start = '-> '
end = ':\ '
number = line[line.find(start)+3: line.find(end)] #[ord('-> '):ord(' :\ ')]
data.append(number) # store all found wors in array
text = line[line.find(end)+3:]
message.append(text)
with open('Msg.txt', 'a') as handler: # create .txt file
for i in range(len(data)):
handler.write(f"{i}|{data[i]}|{message[i]}")
# open with 'w' to "reset" the file.
with open('Msg.txt', 'w') as file_handler:
pass
# ---------------------------------------------------------------------------------
for filename in glob.glob(r'C:\Users\FKAISER\Desktop\AccessSPA\SPA\*.log'):
LogFile(filename, 'Sending Request: Tester')
I have tried using this function
i = 1
d = {}
for filename in filename:
if not d.get(filename, False):
d[filename] = i
i += 1
but then it looks like this
i want each file to have the same number as in the picture, 1 indicates for all the 26 logs and 2 indicates for the 10 file in that folder... etc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract a whole section from 1000 PDF files to Excel - python

Related

PyPdf does not read the pdf text line by line

How to split a PDF with PyMuPDF (with a loop)?

Get information about files in a directory and print in a table

How can I select a certain string in txt file and list it in csv file?

python replace value string with number

Categories

Resources