How to scrape data from PDF into Excel - python

I am trying to scrape the data from PDF and get it saved into an excel file. This is the pdf I needed: https://www.medicaljournals.se/acta/content_files/files/pdf/98/219/Suppl219.pdf
However, I need to scrape not all the data but the following one (below), and then saved it to excel in different cells:
From page 5, starting from P001 to and including Introduction - there is a P number, title, people names, and Introduction.
For now, I can only convert a PDF file into text (my code below) and save it all in one cell, but I need it to be separated into a different cells
import PyPDF2 as p2
PDFfile = open('Abstract Book from the 5th World Psoriasis and Psoriatic Arthritis
Conference 2018.pdf', 'rb')
pdfread = p2.PdfFileReader(PDFfile)
pdflist = []
i = 6
while i<pdfread.getNumPages():
pageinfo = pdfread.getPage(i)
#print(pageinfo.extractText())
i = i + 1
pdflist.append(pageinfo.extractText().replace('\n', ''))
print(pdflist)

The main you need is 'header' regex as 15 UPPERcase letters and 'article' regex letter 'P' and 3 digits.
One more regex helps you to divide your text by any of keywords
article_re = re.compile(r'[P]\d{3}') #P001: letter 'P' and 3 digits
header_re = re.compile(r'[A-Z\s\-]{15,}|$') #min 15 UPPERCASE letters, including '\n' '-' and
key_word_delimeters = ['Peoples', 'Introduction','Objectives','Methods','Results','Conclusions','References']
file = open('data.pdf', 'rb')
pdf = pdf.PdfFileReader(file)
text = ''
for i in range(6, 63):
text += pdf.getPage(i).extractText() # all text in one variable
articles = []
for article in re.split(article_re, text):
header = re.match(header_re, article) # recieving a match
other_text = re.split(header_re, article)[1] # recieving other text
if header:
header = header.group() # get text from match
item = {'header': header}
first_name_letter = header[-1] # save the first letter of name to put it in right position. Some kind of HOT BUGFIX
header = header[:-1] # cut last character: the first letter of name
header = header.replace('\n', '') #delete linebreakers
header = header.replace('-', '') #delete line break symbol
other_text = first_name_letter + other_text
data_array = re.split(
'Introduction:|Objectives:|Methods:|Results:|Conclusions:|References:',
other_text)
for key, data in zip(key_word_delimeters, data_array):
item[key] = data.replace('\n', '')
articles.append(item)

Related

Extract paragraphs instead of sentences with Python Tika

I've been trying around to solve my issue, but nothing seems to be working - I need your help!
I found code that is quite effective at parsing a PDF (e.g. annual report) and extracting sentences from it. I was, however, wondering how I could extract paragraphs from this instead of sentences.
My intuition would be to split paragraphs by not removing the double "\n", i.e. "\n\n", which could indicate the start of a new paragraph. I haven't quite managed to get it to work. Does anyone have any tips?
#PDF parsing
class parsePDF:
def __init__(self, url):
self.url = url
def extract_contents(self):
""" Extract a pdf's contents using tika. """
pdf = parser.from_file(self.url)
self.text = pdf["content"]
return self.text
def clean_text(self):
""" Extract & clean sentences from raw text of pdf. """
# Remove non ASCII characters
printables = set(string.printable)
self.text = "".join(filter(lambda x: x in printables, self.text))
# Replace tabs with spaces
self.text = re.sub(r"\t+", r" ", self.text)
# Aggregate lines where the sentence wraps
# Also, lines in CAPITALS is counted as a header
fragments = []
prev = ""
for line in re.split(r"\n+", self.text):
if line.isupper():
prev = "." # skip it
elif line and (line.startswith(" ") or line[0].islower()
or not prev.endswith(".")):
prev = f"{prev} {line}" # make into one line
else:
fragments.append(prev)
prev = line
fragments.append(prev)
# Clean the lines into sentences
sentences = []
for line in fragments:
# Use regular expressions to clean text
url_str = (r"((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\."
r"([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:#\-_=#])*")
line = re.sub(url_str, r" ", line) # URLs
line = re.sub(r"^\s?\d+(.*)$", r"\1", line) # headers
line = re.sub(r"\d{5,}", r" ", line) # figures
line = re.sub(r"\.+", ".", line) # multiple periods
line = line.strip() # leading & trailing spaces
line = re.sub(r"\s+", " ", line) # multiple spaces
line = re.sub(r"\s?([,:;\.])", r"\1", line) # punctuation spaces
line = re.sub(r"\s?-\s?", "-", line) # split-line words
# Use nltk to split the line into sentences
for sentence in nltk.sent_tokenize(line):
s = str(sentence).strip().lower() # lower case
# Exclude tables of contents and short sentences
if "table of contents" not in s and len(s) > 5:
sentences.append(s)
return sentences
And this is what you can call after
url = URL OF ANY PDF YOU WANT (e.g. https://www.aperam.com/sites/default/files/documents/Annual_Report_2021.pdf)
pp = parsePDF(url)
pp.extract_contents()
sentences = pp.clean_text()
All recommendations are greatly appreciated!
PS: If anyone has a better solution already created, I'd be more than happy to have a look at it!

Finding a piece of information in a document and deleting everything before and after

I have some .docx files that are very specifically formatted.
I have copied the file 5 times to represent the 5 different strings that I require to be "found" and everything else removed.
#! python 3
import docx
import os
import shutil
import readDocx as rD
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
#Select the file you want to work with
fP = rD.file
#get the working directory for the file
nfP = os.path.dirname(os.path.abspath(fP))
#print (nfP)
#Break the filepath into parts
fileSplit = fP.split('/')
#Get the filename only
fileCode = fileSplit[-1]
#print (fileCode)
#Seperate the course code
nameSplit = fileCode.split(' ')
courseCode = nameSplit[0]
#print (courseCode)
#List of files that we need to create
a1 = "Assessment Summary"
a2 = "Back to Business project"
a3 = "Back to Business Checklist"
a4 = "Skills Demonstration"
a5 = "Skills Demonstration Checklist"
names = [a1, a2, a3, a4, a5]
#Creates a list for the new filenames to sit in
newFiles = []
#Creates the files from the original
for name in names:
fileName = os.path.join(nfP + '\\' + courseCode + ' ' + str(name) + ' ' +'Version 1.0' + '.docx')
shutil.copy(fP, fileName)
#print(fileName)
newFiles.append(fileName)
#print (newFiles)
#Need to iterate through the files and start deleting data.
h1 = "Learner Declaration"
h2 = "Back to Business Project"
h3 = "Assessor Observation Checklist / Marking Guide"
h4 = "Skills Demonstration"
h5 = "Assessor Observation Checklist / Marking Guide"
This is where I start to fail in my limited skill. The h1-5 tags represent the heading of the document pieces that I want to keep.
How can I iterate through the document, find the heading and delete everything before / after these paragraphs?
I don't necessarily need the answer, just more of a "look in this direction".
Thanks
Try this. Have clearly mentioned in the comments what the code does.
from docx import Document #Package "Python-docx" needs to be installed to import this
import pandas as pd
# Read the document into a python-docx Document object
document = Document('Path/to/your/input/.docx/document')
#Initialize an empty dataframe to store the .docx document into a dataframe along with the style of each paragraph
document_text_dataframe = pd.DataFrame(columns=['para_text','style'])
#Iterate through the "document" object for extracting the paragraph texts along with their styles into the dataframe "document_text_dataframe"
for para in document.paragraphs:
#Extract paragraph style
style = str(para.style.name)
##### For headings which are created as NORMAL style but are BOLD, we need to extract them as well-
##### Ideally these represent headings as well.
runboldtext = ''
for run in para.runs:
if run.bold:
runboldtext = runboldtext + run.text
if runboldtext == str(para.text) and runboldtext != '':
print("Bold True for:",runboldtext)
style = 'Heading'
#################################################################
dftemp = pd.DataFrame({'para_text':[para.text],'style':[style]})
document_text_dataframe=document_text_dataframe.append(dftemp,sort=False) # Now append each paragraph along with its style into "document_text_dataframe"
document_text_dataframe = document_text_dataframe.reset_index(drop=True)
#Need to iterate through the files and start deleting data.
h1 = "Learner Declaration"
h2 = "Back to Business Project"
h3 = "Assessor Observation Checklist / Marking Guide"
h4 = "Skills Demonstration"
h5 = "Assessor Observation Checklist / Marking Guide"
h_list = [h1,h2,h3,h4]
#Initialize a list to store the extracted information relevant to each "h" value and store them in it
extracted_content=[]
for h in h_list:
df_temp = pd.DataFrame(columns=['para_text','style'])
###########Loop through the document to extract the content related to each "h" value######
start_index=0
end_index=0
for index, row in document_text_dataframe.iterrows():
if h == row['para_text']:
print("Found match in document for: ",h)
start_index = index
print("Matching index=",index)
break
if start_index != 0:
for i in range(start_index+1,len(document_text_dataframe)-1):
if 'Heading' in document_text_dataframe.loc[i,'style']:
end_index = i
break
if end_index !=0:
for i in range(start_index,end_index):
df_temp = df_temp.append(document_text_dataframe.loc[i])
############################################################################################
#Append every extracted content into the list "extracted_content"
if start_index != 0 and end_index!=0:
extracted_content.append(df_temp)
#The list "extracted_content" will consist of dataframes. Each dataframe will correspond to the extracted information of each "h" value.
print(extracted_content)
Now, using extracted_content, you can write every entry in the list extracted_content to a separate .docx document using your code.
Cheers!

split() issues with pdf extractText()

I'm working on a minor content analysis program that I was hoping that I could have running through several pdf-files and return the sum of frequencies that some specific words are mentioned in the text. The words that are searched for are specified in a separate text file (list.txt) and can be altered. The program runs just fine through files with .txt format, but the result is completely different when running the program on a .pdf file. To illustrate, the test text that I have the program running trhough is the following:
"Hello
This is a product development notice
We’re working with innovative measures
A nice Innovation
The world that we live in is innovative
We are currently working on a new process
And in the fall, you will experience our new product development introduction"
The list of words grouped in categories are the following (marked in .txt file with ">>"):
innovation: innovat
product: Product, development, introduction
organization: Process
The output from running the code with a .txt file is the following:
Whereas the ouput from running it with a .pdf is the following:
As you can see, my issue is pertaining to the splitting of the words, where in the .pdf output i can have a string like "world" be split into 'w','o','rld'. I have tried to search for why this happens tirelessly, without success. As I am rather new to Python programming, I would appreciate any answe or direction to where I can fin and answer to why this happens, should you know any source.
Thanks
The code for the .txt is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.txt'
textfile = open(f)
text = textfile.read().split() # lowercase the text
print (text)
textfile.close()
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in text:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])
While the code for the .pdf is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.pdf'
textfile = open(f, 'rb')
text = PyPDF2.PdfFileReader(textfile)# lowercase the text
for pageNum in range(0, text.numPages):
texts = text.getPage(pageNum)
textfile = texts.extractText().split()
print (textfile)
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in textfile:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])

If [string] in [string] not working for requested webpage text

I am trying to open a webpage and scrape some strings from it into a list. The list would ultimately be populated by all of the names displayed on the webpage. In trying to do so, my code looks like this:
import xlsxwriter, urllib.request, string, http.cookiejar, requests
def main():
username = 'john.mauran'
password = 'fZSUME1q'
log_url = 'https://aries.case.com.pl/'
dest_url = 'https://aries.case.com.pl/main_odczyt.php?strona=eksperci'
login_values = {'username' : username , 'password' : password }
r = requests.post(dest_url, data=login_values, verify=False, allow_redirects=False)
open_sesame = r.text
#reads the expert page
readpage_list = open_sesame.splitlines()
#opens up a new file in excel
workbook = xlsxwriter.Workbook('expert_book.xlsx')
#adds worksheet to file
worksheet = workbook.add_worksheet()
#initializing the variable used to move names and dates
#in the excel spreadsheet
boxcoA = ""
boxcoB = ""
#initializing expert attribute variables and lists
url_ticker = 0
name_ticker = 0
raw_list = []
url_list = []
name_list= []
date_list= []
#this loop goes through and finds all the lines
#that contain the expert URL and name and saves them to raw_list::
#raw_list loop
for i in open_sesame:
if '<tr><td align=left><a href=' in i:
raw_list += i
if not raw_list:
print("List is empty")
if raw_list:
print(raw_list)
main()
As you can see, all I want to do is take the lines from the text returned by the Requests operation which start with the following characters '
I don't know exactly what you're trying to do, but this doesn't make any sense:
for i in open_sesame:
if '<tr><td align=left><a href=' in i:
raw_list += i
First of all, if you iterate over open_sesame, which is a string, each item in the iteration will be a character in the string. Then '<tr><td align=left><a href=' in i will always be false.
Second of all, raw_list += i is not how you append an item to a list.
Finally, why is the variable called open_sesame? Is it a joke?

Excel List iteration

I am working on a text search project. I have 2 lists.
a = ['ibm','dell']
b =['strength','keyword']##this is a list of keywords given by the user
Now i create combinations for searching google.
lst = list(itertools.product(a, b))
What i need help on is below:
using the code i will search for the text using different keywords and their lemma. After that I need to write the searched text to an excel file. I need to create worksheets with the names in list A and write it only the searched text in different worksheets. I am not able to figure. below is part of my code.
def getarticle(url,n):
final =[]
regex ='(.*).pdf'
pattern = re.compile(regex)
if re.match(pattern,url) is not None:
text = pdf_to_text(url)
final.append('')
final.append(url)
final.append(text)
New_file = open((('text' + str((round(random.random(),2))) + '.txt')),'w+')
New_file.write(smart_str(unicode(text,'utf-8')))
New_file.close()
else:
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','Chrome')]
html = br.open(url).read()
titles = br.title()
readable_article= Document(html).summary()
readable_title = Document(html).short_title()
soup = bs4.BeautifulSoup(readable_article)
Final_Article = soup.text
final.append(titles)
final.append(url)
final.append(Final_Article)
raw = nltk.clean_html(html)
cleaned = re.sub(r'& ?(ld|rd)quo ?[;\]]', '\"', raw)
tokens = nltk.wordpunct_tokenize(raw)
lmtzr = WordNetLemmatizer()
t = [lmtzr.lemmatize(t) for t in tokens]
text = nltk.Text(t)
word = words(n)
find = ' '.join(str(e) for e in word)
search_words = set(find.split(' '))
sents = ' '.join([s.lower() for s in text])
blob = TextBlob(sents.decode('ascii','ignore'))
matches = [map(str, blob.sentences[i-1:i+2]) # from prev to after next
for i, s in enumerate(blob.sentences) # i is index, e is element
if search_words & set(s.words)]
return ''.join (str(y).replace('& rdquo','').replace('& rsquo','') for y in matches)
This returns the text now i need to write to excel files which i am unable to code.
As far as writing text out to a file Excel can read is concerned, ou might want to look at Python's csv library, which provides lots of useful .csv manipulation tools.

Categories