I need to extract text from a pdf using Python (NLP application), and want to leave out the first 5 lines from the text on every page. I tried looking online, but couldn't find anything substantial. I am using the below code to read all text on the pages. Is there a post-extraction step that can remove from all pages the first few lines, or maybe something that can be done at extraction stage itself?
fileReader = PyPDF2.PdfFileReader(file)
s=""
for i in range(2, fileReader.numPages):
s+=fileReader.getPage(i).extractText()
split text with "\n" and slice to remove the first 5 lines:
import pdfplumber
pdf = pdfplumber.open("CS.pdf")
for page in pdf.pages:
text = page.extract_text()
for line in text.split("\n")[5:]:
print(line)
Related
I am still a newbie to Python. I am trying to develop a generic pdf-scraper to csv that is organized by the columns containing: page number and paragraphs.
I'm using the PyMuPDF library and I have managed to extract all the text. But I have no clue how to parse the text and write it into csv:
page number, paragraph
page number, paragraph
page number, paragraph
Luckily there is a structure. Each paragraph ends with an enter (\n). Each page ends with a page number followed by an enter (\n). I would like to include headers as well but they are harder to delimit.
import fitz
import csv
pdf = '/file/path.pdf'
doc = fitz.open(pdf)
for page in doc:
text = page.getText(text)
print (text)
I have created a code to parse through multiple pdf files and return a line of data from each page. I came across the issue that some of the pages within my pdf files do not have this line. When this happens my code just omits the page entirely; however I would like it to print a single 'none' for the pages where it can not find the specified line. I thought this was a simple fix but its proving to be a little more complicated that I thought. Here is an example of the line I am pulling and what I have tried:
#pattern I told my code to look for within each page of pdf
sqft_re = re.compile('(\d+(sqft)\s+[$]\d+[.]\d+\s+\d{2}/\d{2})')
#this is an example of what the line I want in each page looks like:
'1600sqft $154.98 10/14'
Basically I want the code to parse through every pdf and return the line if it can find it. If it can not I want it to return a single 'none' for said page without that line. I have called the lines to a list like so:
lines = []
Here is how I set my for loop to look through each page of my pdf files:
for files in os.listdir(directory):
if files.endwith(".pdf"):
with pdfplumber.open(files) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = sqft_re.search(line)
if line:
line.group(1)
lines.append(line)
Example of output:
lines
'1600sqft $154.98 10/14'
'1450qft $113.02 07/05'
'90sqft $60.17 05/12'
'3000sqft $500.98 09/20'
This code successfully returns a the list of data for pages with the line. However pages without the line are omitted. Here is what I thought would fix the problem and simply print none for pages without the line:
for files in os.listdir(directory):
if files.endwith(".pdf"):
with pdfplumber.open(files) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
line = sqft_re.search(line)
if line:
line.group(1)
else:
line = 'None'
lines.append(line)
However this did not work and now instead of just substituting 'None' for pages without the value every single line within the pdf page is printed as 'None' except for where it matches the line. So basically I now have a list that looks like this:
lines
'None'
'None'
'None'
'1600sqft $154.98 10/14'
'None'
'None'
'None'
'1450qft $113.02 07/05' #etc.....
I have tried some other things like calling a different function when it does not match what I am looking for, making my own string to substitute the value with and a couple more. I am still getting the same problem. In my sample pdf there is only one page without this line so my list should look like:
'1600sqft $154.98 10/14'
'1450qft $113.02 07/05'
'90sqft $60.17 05/12'
'3000sqft $500.98 09/20'
'None'
I am also pretty new to python (R is what I primarily work with) so I am sure I am overlooking something here but any guidance to what I am missing would be appreciated!
You should append the match to the lines variable, not the line itself, unless that is your intention.
Besides, you need to set a flag to False before checking each page and once there is a match, set it to True. If it is False at the end of the page, add None to the lines.
See a sample Python code with the loop:
for page in pdf.pages:
text = page.extract_text()
found = False
for line in text.split('\n'):
line = sqft_re.search(line)
found = not found
lines.append(line.group(1))
if not found:
lines.append('None')
I am a beginner in python and I am using it for my master thesis, so I don't know that much. I have a bunch of annual reports (in txt format) files and I want to select all the text between "ITEM1." and "ITEM2.". I am using the re package. My problem is that sometimes, in those 10ks, there is a section called "ITEM1A.". I want the code to recognize this and stop at "ITEM1A." and put in the output the text between "ITEM1." and "ITEM1A.". In the code I attached to this post, I tried to make it stop at "ITEM1A.", but it does not, it continues further because "ITEM1A." appears multiple times through the file. I would be ideal to make it stop at the first one it sees. The code is the following:
import os
import re
#path to where 10k are
saved_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/saved files/"
#path to where to save the txt with the selected text between ITEM 1 and ITEM 2
selected_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/10k_select/"
#get a list of all the items in that specific folder and put it in a variable
list_txt = os.listdir(saved_path)
for text in list_txt:
file_path = saved_path+text
file = open(file_path,"r+", encoding="utf-8")
file_read = file.read()
# looking between ITEM 1 and ITEM 2
res = re.search(r'(ITEM[\s\S]*1\.[\w\W]*)(ITEM+[\s\S]*1A\.)', file_read)
item_text_section = res.group(1)
saved_file = open(selected_path + text, "w+", encoding="utf-8") # save the file with the complete names
saved_file.write(item_text_section) # write to the new text files with the selected text
saved_file.close() # close the file
print(text) #show the progress
file.close()
If anyone has any suggestions on how to tackle this, it would be great. Thank you!
Try the following regex:
ITEM1\.([\s\S]*?)ITEM1A\.
Adding the question mark makes it non-greedy thus it will stop at the first occurrence
I'm trying to write a programm for Data extraction from PDF in Python (Excel Macro could be an option) .
At first at want to select a text or a position in a pdf file and generate a local path/link to that file in that position. This link will be copied to an excel cell. When I click on the link the PDF document should open on the specified coordinates of the previously selected text.
I know the question is very broad. I'm an enthusiast beginner and need a nudge in the right direction and to know if it is possible.
How can I get the path of the active pdf file in the desktop? and the coordinate of the selected text? I could give these automatically as parameters then to my programm.
Thank you !
There are lot of ways to do this, I would say look into the Slate --> https://pypi.python.org/pypi/slate , or http://www.unixuser.org/~euske/python/pdfminer/index.html
And yes it's quite easy , also look into pyPdf
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
print getPDFContent("test.pdf")
So far here is the code I have (it is working and extracting text as it should.)
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("/home/nick/TAM_work/TAM_pdfs/2006-1.pdf").encode("ascii", "ignore")
I now need to add a for loop to get it to run on all PDF's in /TAM_pdfs, save the text as a CSV and (if possible) add something to count the pictures. Any help would be greatly appreciated. Thanks for looking.
Matt
Take a look at os.walk()
for loop to get it to run on all PDF's in a directory: look at the glob module
save the text as a CSV: look at the csv module
count the pictures: look at the pyPDF module :-)
Two comments on this statement:
content = " ".join(content.replace(u"\xa0", " ").strip().split())
(1) It is not necessary to replace the NBSP (U+00A0) with a SPACE, because NBSP is (naturally) considered to be whitespace by unicode.split()
(2) Using strip() is redundant:
>>> u" foo bar ".split()
[u'foo', u'bar']
>>>
The glob module can help you find all files in a single directory that match a wildcard pattern.