I have multiple PDF files created with Access DB forms. The only way I can extract text from them is using pdfplumber. Here is my code and it works perfectly for just 1 file.
import pdfplumber
with pdfplumber.open('CS_page_1.pdf') as pdf:
page = pdf.pages[0]
string = page.extract_text()
file_name = string[43:48]
print(file_name)
I need to use this extracted string to rename this file and the 100 other files in the folder.
What would be the best way to do it?
Would first build a list of all the pdfs in your folder using glob (https://docs.python.org/3/library/glob.html).
Then iterate through each of them- pdfplumb them to obtain the desired string (which you want to rename the file to)- and then rename each individually (https://www.tutorialspoint.com/python/os_rename.htm). Something like this:
import glob
import pdfplumber
import os
arr_of_files = (glob.glob("/path/to/pdfs/*.pdf"))
for file in arr_of_files:
with pdfplumber.open(file) as pdf:
page = pdf.pages[0]
string = page.extract_text()
file_name = string[43:48]
os.rename(file, file_name)
import pdfplumber
import glob
from tqdm.auto import tqdm
for current_pdf_file in tqdm(glob.glob("<pathname>\.pdf")):
with pdfplumber.open(current_pdf_file) as my_pdf:
# do other things here?
Related
The following Python code extracts images from a Pdf file and saves them as jp2 files. The files are then named im1.jp2 and im2.jp2 and seem to be overwritten with a new pdf file from the path on the next run.
How can I give the jp2 files a specific name within the Write() method? E.g. pathname_im1.jp2? Or is it possible to rename it directly?
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for image in page.images:
with open(image.name, "wb") as fp:
fp.write(image.data)
Well, there are actually a lot of good ways to do this. As it was pointed out in the comments, you could just use an enumerate:
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for index, image in enumerate(page.images):
filename = f'{index}{image.name}'
with open(filename, "wb") as fp:
fp.write(image.data)
Or you could append the datetime (which I think is better and more reliable across different runs, if you haven't anything better to use).
from datetime import datetime
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
# Note this will use the same datetime for all images
date = datetime.now().strftime("%Y%m%d_%H%M%S")
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for image in page.images:
filename = f'{date}_{image.name}'
with open(filename, "wb") as fp:
fp.write(image.data)
You can then modify this based on how exactly you want the thing.
Let consider a folders(Mandar and html) on your desktop. Now paste any pdf file and named it 'dell' in 'html' folder and create demo.py file in 'Mandar' folder. Now create some txt files(2-4) as your wish so that 'html' folder contains some txt files and only one pdf file.
import os
import PyPDF2 # install via 'pip install PyPDF2'
# Put location of your pdf file i.e. dell.pdf in 'location' variable
location = "C:/Users/Desktop/html/"
n = "dell.pdf"
path = os.path.join(location, n)
reader = PyPDF2.PdfReader(path)
pages = len(reader.pages)
print(f"The no. of pages in {n} is {pages}.")
Now run program and you see that
''The no. of pages in dell.pdf is NUM.'' //NUM is no. of pages of your pdf
Now let consider 'html' folder always contain only one pdf file with any name maybe dell, maybe ecc, maybe any name. I want that variable 'n' stores this one pdf file in itself as input so that the program will run and display same result with different pdf file name and Num.
Give glob in the standard library a shot. It'll get you a list of all the matching PDF files in that directory.
import os
import PyPDF2
...
import glob
Location='C:/Users/Desktop/html/'
candidates = glob.glob(os.path.join(Location, '*.pdf'))
if len(candidates) == 0:
raise Exception('No PDFs found')
File=open(candidates[0],'rb')
...
You're looking for globbing. You can do that with pathlib:
from pathlib import Path
root = Path(location)
pdf_files = root.glob("*.pdf")
I have a script, below, that can download files from a particular row from 1 only CSV file. I have no problem with it, it works well and all files are downloaded into my 'Python Project' folder, root.
But I would like to add functions here, First, download not only 1 but multiple (20 or more) CSV files then I don't have to change the name manually here - open('name1.csv') everytime my script has done the job. Second request, downloads need to be placed in a folder with the same name of the csv file that downloads come from. Hopefully I'm clear enough :)
Then I could have:
name1.csv -> name1 folder -> download from name1 csv
name2.csv -> name2 folder -> download from name2 csv
name3.csv -> name3 folder -> download from name3 csv
...
Any help or suggestions will be more than appreciate :) Many thanks!
from collections import Counter
import urllib.request
import csv
import os
with open('name1.csv') as csvfile: #need to add multiple .csv files here.
reader = csv.DictReader(csvfile)
title_counts = Counter()
for row in reader:
name, ext = os.path.splitext(row['link'])
title = row['title']
title_counts[title] += 1
title_filename = f"{title}_{title_counts[title]}{ext}".replace('/', '-') #need to create a folder for each CSV file with the download inside.
urllib.request.urlretrieve(row['link'], title_filename)
You need to add an outer loop which will iterate over files in specific folder. You can use either os.listdir() which returns list of all entries or glob.iglob() with *.csv pattern to get only files with .csv extension.
Also there are some minor improvements you can make in your code. You're using Counter in the way that it can be replaced with defaultdict or even simple dict. Also urllib.request.urlretrieve() is a part of legacy interface which might get deprecated, so you can replace it with combination of urllib.request.urlopen() and shutil.copyfileobj().
Finally, to create a folder you can use os.mkdir() but previously you need to check whether folder already exists using os.path.isdir(), it's required to prevent FileExistsError exception.
Full code:
from os import mkdir
from os.path import join, splitext, isdir
from glob import iglob
from csv import DictReader
from collections import defaultdict
from urllib.request import urlopen
from shutil import copyfileobj
csv_folder = r"/some/path"
glob_pattern = "*.csv"
for file in iglob(join(csv_folder, glob_pattern)):
with open(file) as csv_file:
reader = DictReader(csv_file)
save_folder, _ = splitext(file)
if not isdir(save_folder):
mkdir(save_folder)
title_counter = defaultdict(int)
for row in reader:
url = row["link"]
title = row["title"]
title_counter[title] += 1
_, ext = splitext(url)
save_filename = join(save_folder, f"{title}_{title_counter[title]}{ext}")
with urlopen(url) as req, open(save_filename, "wb") as save_file:
copyfileobj(req, save_file)
For 1: Just loop over a list containing the names of your desired files.
The list can be retrieved using "os.listdir(path)" which returns a list of the files contained inside your "path" (a folder containing the csv files in your case).
I am reading pdf files and trying to extract keywords from them through NLP techniques.Right now the program accepts one pdf at a time. I have a folder say in D drive named 'pdf_docs'. The folder contains many pdf documents. My goal is to read each pdf file one by one from the folder. How can I do that in python. The code so far working successfully is like below.
import PyPDF2
file = open('abc.pdf','rb')
fileReader = PyPDF2.PdfFileReader(file)
count = 0
while count < 3:
pageObj = fileReader.getPage(count)
count +=1
text = pageObj.extractText()
First read all files that are available under that directory
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
And then run your code for each file in that list
import PyPDF2
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
fileReader = PyPDF2.PdfFileReader(open(file,'rb'))
count = 0
while count < 3:
pageObj = fileReader.getPage(count)
count +=1
text = pageObj.extractText()
os.listdir() will get you everything that's in a directory - files and directories. So be careful to have only pdf files in your path or you will need to implement simple filtration for list.
Edit 1
You can also use glob module, as it does pattern matching.
>>> import glob
>>> print(glob.glob('/home/rszamszur/*.sh'))
['/home/rszamszur/work-monitors.sh', '/home/rszamszur/default-monitor.sh', '/home/rszamszur/home-monitors.sh']
Key difference between OS module and glob is that OS will work for all systems, where glob only for Unix like.
you can use glob in order use pattern matching for getting a list of all pdf files in your directory.
import glob
pdf_dir = "/foo/dir"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
do_your_stuff()
import PyPDF2
import re
import glob
#your full path of directory
mypath = "dir"
for file in glob.glob(mypath + "/*.pdf"):
print(file)
if file.endswith('.pdf'):
fileReader = PyPDF2.PdfFileReader(open(file, "rb"))
count = 0
count = fileReader.numPages
while count >= 0:
count -= 1
pageObj = fileReader.getPage(count)
text = pageObj.extractText()
print(text)
num = re.findall(r'[0-9]+', text)
print(num)
else:
print("not in format")
Let's go through the code:
In python we can't handle Pdf files normally.
so we need to install PyPDF2 package then import the package.
"glob" function is used to read the files inside the directory.
using "for" loop to get the files inside the folder.
now check the file type is it in pdf format or not by using "if" condition.
now we are reading the pdf files in the folder using "PdfFileReader"function.
then getting number of pages in the pdf document.
By using while loop to getting all pages and print all text in the file.
I am trying to use PyPDF2 to grab the number of pages of every pdf in a directory. I can use .getNumPages() to find the number of pages in one pdf file but I need to walk through a directory and get the number of pages for every file. Any ideas?
Here is the code I have so far:
import pandas as pd
import os
from PyPDF2 import PdfFileReader
df = pd.DataFrame(columns=['fileName', 'fileLocation', 'pageNumber'])
pdf=PdfFileReader(open('path/to/file.pdf','rb'))
for root, dirs, files in os.walk(r'Directory path'):
for file in files:
if file.endswith(".pdf"):
df2 = pd.DataFrame([[file, os.path.join(root,file),pdf.getNumPages()]], columns=['fileName', 'fileLocation', 'pageNumber'])
df = df.append(df2, ignore_index=True)
This code will just add the number of pages from the first PDF file in the directory to the dataframe. If I try to add a directory path to PdfFilereader() I get a
PermissionError:[Errno 13] Permission denied.
Yeah, use
import glob
list_of_pdf_filenames = glob.glob('*pdf')
to return the list of all PDF filenames in a directory.
**Edit: **
By placing the open() statement inside the loop, I was able to get this code to run on my computer:
import pandas as pd
import os
from PyPDF2 import PdfFileReader
df = pd.DataFrame(columns=['fileName', 'fileLocation', 'pageNumber'])
for root, dirs, files in os.walk(r'/home/benjamin/docs/'):
for f in files:
if f.endswith(".pdf"):
pdf=PdfFileReader(open(os.path.join(root, f),'rb'))
df2 = pd.DataFrame([[f, os.path.join(root,f), pdf.getNumPages()]], columns=['fileName', 'fileLocation', 'pageNumber'])
df = df.append(df2, ignore_index=True)
print(df.head)
step 1:-
pip install pyPDF2
step 2:-
import requests, PyPDF2, io
url = 'sample.pdf'
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()
print(num_pages)