Python scraping an unstructured PDF

Python scraping an unstructured PDF - python

We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence page.
Ideally I would like to be able to write a python app to be able to scrape certain sections out of the PDF. The structure is pretty much as follows (with the bold parts being the ones I want to extract):
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X
description
Defect fixes
description
table with defect descriptions
rest of the document is irrelevant in this case
I have managed to get it to import the file and extract (all) of the text, but I have really got no idea how to extract only the headings for section 2, and then for section 3 only take the table and reformat it with pandas. Any suggestions on how to go about this ?
import fitz
filename = '~\releasenotes.pdf'
doc = fitz.open(filename)
print (doc) # Just to see what comes out
(and now what should I do next ?)

A simple regex (regular expression) should do the trick here. I'm making some big assumptions around what the text looks like when it comes out of your pdf read - I have copied the text from your post and called it "doc" per your question :)
import re #regular expression library
doc = '''
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X description
'''
ds_features = pd.Series(re.findall('2.[1-9].*\n', doc))
Let me unpack that last line:
re.findall will produce a list of items in your document that matches the search string
'2.[1-9].*\n' will find all instances of a 2. followed by any number from [1-9], followed by any number of characters .* until it reaches a line break \n.
Hope this fits the bill?

Related

Create word lists from medical journals

I have been asked to compile a crossword for a surgeon's publication, - it comes out quarterly. I need to make it medically oriented, preferably using different specialty words. eg some will be orthopaedics, some cardiac surgery and some human anatomy etc.
I can get surgical journals online.
I want to create word lists for each specialty and use them in the compiler. I will use crossword compiler .
I can use journal articles on the web, or downloaded pdf's. I am a surgeon and use pandas for data analysis but my python skills are a bit primitive so I need relatively simple solutions. How can I create the specific word lists for each surgical specialty.
They don't need to be very specific words, so eg I thought I could scrape the journal volume for words, compare them to a list of common words and delete those leaving me with a technical list. May require some trial and error. I havent used beautiful soup before but willing to try it.
Alternatively I could just get rid of the beautful soup step and use endnote to download a few hundred journals and export to txt.
Its the extraction and list making I think i am mainly struggling to conceptualise.

I created this program that you can use to parse through a .txt file to find the most common words. I also included a block of code that will help you to convert a .pdf file to .txt. Hope my approach to the solution helps, good luck with your crossword for the surgeon's publication!
'''
Find the most common words in a txt file
'''
import collections
# The re module provides regular expression matching operations
import re
'''
Use this if you would like to convert a PDF to a txt file
'''
# import PyPDF2
# pdffileobj=open('textFileName.pdf','rb')
# pdfreader=PyPDF2.PdfFileReader(pdffileobj)
# x=pdfreader.numPages
# pageobj=pdfreader.getPage(x-1)
# text=pageobj.extractText()
# file1=open(r"(folder path)\\textFileName.txt","a")
# file1.writelines(text)
# file1.close()
words = re.findall(r'\w+', open('textFileName.txt').read().lower())
most_common = collections.Counter(words).most_common(10)
print(most_common)

MS word document document stucturre and COM calls and python

I am using comptypes to call function and create ms-word document. Being the first time writing such a program there is something I don't understand right, what I want to do is:
Create section in the document and call them A, B, ...
In each section create paragraphs that contain text. For section A call the paragraphs a1,a2,a3,...
Add formatting to each paragraph in each section, the formatting may be different for each paragraphs
Below is some code fragments in VBA, VBA is used since the translations to use comptypes are almost directly and there are more examples on the net for VBA.
Set myRange = ActiveDocument.Range(Start:= ...., End:= ...) //start and end can be any thing
ActiveDocument.Sections.Add Range:=myRange //Section A
Set newRange = ActiveDocument.Range(Start:= ...., End:= ...) //start and end can be any thing
newRange.Paragraphs.Add
I get stuck to select paragraphs a1 and set its text. What is missing for me is a function that say get collection of paragraphs in section A.

The following VBA, based on the code in the question, illustrates getting a Document object, adding a Section, getting the Paragraphs of that Section, getting the Paragraphs of any given Section in a document, getting the first or any Paragraph from a Paragraphs collection.
Set doc = ActiveDocument //More efficient if the Document object will be used more than once
Set section1 = doc.Sections.Add(Range:=myRange) //Section A | Type Word.Section
Set section1Paras = section1.Paragraphs //Type Word.Paragraphs
//OR
Set sectionParas = doc.Sections(n).Paragraphs //where n = section index number
Set para = sectionParas.First //OR =sectionParas(n) where n= paragraph index number

how to restore a splitted word by removing the hyphen "-" because of hyphenation in a paragraph using python

simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?

I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?

The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.

To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.

Download the posts from a Google group before resetting

I want to reset one of my groups (a class discussion), but I would like to retain the discussion for reference. There aren't many posts (maybe 50), and I could do it by hand, but is there a way to do that through google apps scripts or python?
I found a few possibilities, but neither in a language I'm familiar with (though I might be able to translate):
this link: http://saturnboy.com/2010/03/scraping-google-groups/
this Perl code:
#!/usr/bin/perl
# groups2csv.pl
# Google Groups results exported to CSV suitable for import into Excel.
# Usage: perl groups2csv.pl < groups.html > groups.csv
# The CSV Header.
print qq{"title","url","group","date","author","number of articles"\n};
# The base URL for Google Groups.
my $url = "http://groups.google.com";
# Rake in those results.
my($results) = (join '', <>);
# Perform a regular expression match to glean individual results.
while ( $results =~ m!<a href=(/groups[^\>]+?rnum=[0-9]+)>(.+?)</a>.*?
<br>(.+?)<br>.*?<a href="?/groups.+?class=a>(.+?)</a> - (.+?) by
(.+?)\s+.*?\(([0-9]+) article!mgis ) {
my($path, $title, $snippet, $group, $date, $author, $articles) =
($1||'',$2||'',$3||'',$4||'',$5||'',$6||'',$7||'');
$title =~ s!"!""!g; # double escape " marks
$title =~ s!<.+?>!!g; # drop all HTML tags
print qq{"$title","$url$path","$group","$date","$author","$articles"\n\n};
}

Take a look at the HTTrack utility mentioned in this webapps question and in this forum discussion.
Note I'm assuming you don't actually want to screen scrape and process data but merely have a copy of the discussion for future reference.
EDIT: If you actually want to screen scrape, you can do this too but writing a script to do it can be a significant time sink. Screen scraping is more about extracting specific pieces of data from an html document than it is grabbing the entire html document. An example where you might need to screen scrape would be if you were looking at the jeopardy website and wanted to grab individual questions, their point values, who answered them right, which game they occurred in, etc for insertion into a database.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scraping an unstructured PDF - python

Related

Create word lists from medical journals

MS word document document stucturre and COM calls and python

how to restore a splitted word by removing the hyphen "-" because of hyphenation in a paragraph using python

Search methods and string matching in python

Download the posts from a Google group before resetting

Categories

Resources