Using Python to parse pdf and extract Author and Book name

Using Python to parse pdf and extract Author and Book name - python

I have a Mailing Reference List in a form a pdf. The mailing list has a very general format i.e Author Name followed by the Name of the book.
Consider the following examples:
American Reading List
Democratic Theory
• Dahl, Preface to Democratic Theory
• Schumpeter, Capitalism, Socialism, and Democracy (Introduction and part IV only)
• Machperson, Life and Times of Liberal Democracy
• Dahl, Democracy and its Critics
Now I am trying to parse the pdf using pdf miner and create a list where in the first index is the author name and the second index is the name of the book just like this:
[Dahl, Preface to Democratic Theory]
I am trying to use the split functionality because there is a comma and a space followed by the Author name. However I don't get the correct results.
Can somebody help?
def extract():
string = convert_pdf_to_txt("/Users/../../names.pdf")
lines = list(filter(bool, string.split('\n')))
for i in lines:
check.extend(i.split(','))
x=remove_numbers(check)
remove_blank= [x for x in x if x]
combine_two = [remove_blank[x:x + 2] for x in xrange(0,len(remove_blank), 2)]
print combine_two

Let's see what's going wrong here. I'm making some guesses, but hopefully they are the relevant ones.
Your convert_pdf_to_text() function returns a single long string containing all the text of the PDF.
You split the text on ", " which results in a list of strings.
Given your example data, this list looks something like this (each element is on a separate line here):
Dahl
Preface to Democratic Theory(line break)(bullet)(tab)Schumpeter
Captitalism
Socialism
and Democracy (Introduction and part IV only)(line break)(bullet)(tab)Machpherson
Life and Times of Liberal Democracy(line break)(bullet)(tab)Dahl
Democracy and its Critics
Because you split on ", " without regard for the fact that the data is formatted as lines, you end up with stuff from multiple lines in each item.
Now you use filter() to iterate over this list and filter out all the ones that aren't true. A non-empty string is true, and all of the elements are non-empty strings, so all the elements get through. Your filter() therefore doesn't do anything.
What you seem to want is something more like this:
lines = [line.split(", ", 1) for line in string.splitlines() if ", " in line]
Here we first split the lines, filter out any that don't have comma-space in them, and return a list of lists based on splitting the string on the first comma-space.

Related

Check if there are numbers around a keyword in a text file

I am having a text file 'Filter.txt' which contains a specific keyword 'D&O insurance'. I would check if there are numbers in the sentence which contains that keyword, as well as the 2 sentences before and after that.
For example, I have a long paragraphe like this:
"International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered? "
The target word is "D&O insurance." If I wanted to extract the target sentence (D&O insurance grants cover on a claims-made basis.) as well as the preceding and following sentences (Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. and How much is enough?), what would be a good approach?
This is what I'm trying to do so far. However I don't really know how to apply to find ways to check in the whole sentence and the ones around it.
for line in open('Filter.txt'):
match = re.search('D&O insurance(\d+)',line)
if match:
print match.group(1)
I'm new to programming, so I'm looking for the possible solutions for that purpose.
Thank you for your help!

Okay I'm going to take a stab at this. Assume string is the entire contents of your .txt file (you may need to clean the '/n's out).
You're going to want to make a list of potential sentence endings, use that list to find the index positions of the sentence endings, and then use THAT list to make a list of the sentences in the file.
string = "International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered?"
endings = ['! ', '? ','. ']
def pos_find(string):
lst = []
for ending in endings:
i = string.find(ending)
if i != -1:
lst.append(string.find(ending))
return min(lst)
def sort_sentences(string):
sentences = []
while True:
try:
i = pos_find(string)
sentences.append(string[0:i+1])
string = string[i+2:]
except ValueError:
sentences.append(string)
break
return sentences
sentences = sort_sentences(string)
Once you have the list of sentences (I got a little weary here, so forgive the spaghetti code - the functionality is there), you will need to comb through that list to find characters that could be integers (this is how I'm checking for numbers...but you COULD do it different).
for i in range(len(sentences)):
sentence = sentences[i]
match = sentence.find('D&O insurance')
print(match)
if match >= 0:
lst = [sentences[i-1],sentence, sentences[i+2]]
for j in range(len(lst)):
sen = lst[j]
for char in sen:
try:
int(char)
print(f'Found {char} in "{sen}", index {j}')
except ValueError:
pass
Note that you will have to make some modifications to capture multi-number numbers. This will just print something for each integer in the full number (i.e. it will print a statement for 1, 0, and 0 if it finds 100 in the file). You will also need to catch the two edge cases where the D&O insurance substring is found in the first or last sentences. In the code above, you would throw an error because there would be no i-1 (if it's the first) index location.

Extracting the Last Name from a list of Full Names using Python / Pandas and possibly Regular Expressions

I am wrangling with a dataset and I have ended up having a list of names of the following form:
s = ['DR. James Coffins',
'Zacharias Pallefas',
'Matthew Ebnel',
'Ranzzith Redly',
'GEORGE GEORGIADAKIS',
'HARISH KUMARAN K',
'Christiaan Kraanlen, CFA',
'Mary K. Lein, CFA, COL',
'Alexandre Cegra, CFA, CAIA'
'Anna Bely']
I must extract the last names and place them in a separate list (or column in a pandas dataframe). However I am puzzled with the polymorphism of the Full Names and I am novice in Python.
A possible algorithm would be the following:
Loop through the elements of the list. For each element:
split the element into subelements using spaces. Then:
a) If there are four or less subelements start from the beginning and
examine the first four subelements.
a1) If the first subelement is larger than 2 letters then: If the
second subelement is larger than one letter, return the second
subelement. Otherwise, return the third subelement.
a2) if the first subelement is 2 letters then drop it and repeat
step a1

How about always grabbing the second element of each line after skipping words that contain . and not in a exlude list ['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']
>>> exclude_tags = ['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']
>>> [[y for y in x.split() if '.' not in y and y.lower() not in exclude_tags][1].rstrip(',').capitalize() for x in s]
['Coffins', 'Pallefas', 'Ebnel', 'Redly', 'Georgiadakis', 'Kumaran', 'Kraanlen', 'Lein', 'Cegra']

For anyone else coming across this question, keep in mind that it is impossible in general to perfectly extract a person's surname from their full name, and go read Falsehoods Programmers Believe About Names
Sunitha's solution will fail for anyone whose last name is composed of more than one token (van Gogh), has more than one last name (Gonzalez Ramirez), has a first name that has more than one token (Mary Jane Watson), chose to put their middle name in whatever system created this list, is from an Asian culture where the order of given name / surname is sometimes reversed, etc.

How to find and rank all prefixes in a list of strings?

I have a list of strings and I want to find popular prefixes. The prefixes are special in that they occur as strings in the input list.
I found a similar question here but the answers are geared to find the one most common prefix:
Find *most* common prefix of strings - a better way?
While my problem is similar, it differs in that I need to find all popular prefixes. Or to maybe state it a little simplistically, rank prefixes from most common to least.
As an example, consider the following list of strings:
in, india, indian, indian flag, bull, bully, bullshit
Prefixes rank:
in - 4 times
india - 3 times
bull - 3 times
...and so on. Please note - in, bull, india are all present in the input list.
The following are not valid prefixes:
ind
bu
bul
...since they do not occur in the input list.
What data structure should I be looking at to model my solution? I'm inclined to use a "trie" with a counter on each node that tracks how many times has that node been touched during the creation of the trie.
All suggestions are welcome.
Thanks.
p.s. - I love python and would love if someone could post a quick snippet that could get me started.

words = [ "in", "india", "indian", "indian", "flag", "bull", "bully", "bullshit"]
Result = sorted([ (sum([ w.startswith(prefix) for w in words ]) , prefix ) for prefix in words])[::-1]
it goes through every word as a prefix and checks how many of the other words start with it and then sorts the result. the[::-1] simply reverses that order

If we know the length of the prefix (say 3)
from nltk import FreqDist
suffixDist=FreqDist()
for word in vocabulary:
suffixDist[word[-3:]] +=1
commonSuffix=[suffix for (suffix,count) in suffixDist.most_common(150) ]
print(commonSuffix)

Remove items in string paragraph if they belong to a list of strings?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
Using BeautifulSoup, I scraped one of Obama's speeches off of the above website. Now, I need to replace some residual HTML in an efficient manner. I've stored a list of elements I'd like to eliminate in remove_char. I'm trying to write a simple for statement, but am getting the error: TypeError: expected a character object buffer. It's a beginner question, I know, but how can I get around this?

Since you are using BeautifulSoup already , you can directly use obama_4427_div.text instead of str(obama_4427_div) to get the correctly formatted text. Then the text you get would not contain any residual html elements, etc.
Example -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
For completeness, for removing elements from a string, I would create a list of elements to remove (like the remove_char list you have created) and then we can do str.replace() on the string for each element in the list. Example -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')

fixing words with spaces using a dictionary look up in python?

I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.

Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
http://jeremykun.com/2012/01/15/word-segmentation/
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
https://gist.github.com/miku/7279824
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.

Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.

--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))
demo:http://ideone.com/pt4PSt
--Solution#2:
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0

I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
break
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
spaced.append(word)
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
break
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.

Here's something really basic:
chunks = []
for chunk in my_str.split():
chunks.append(chunk)
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words

You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
finished_sentence.append(words)
added = True
else:
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
iterate(word,dictionary)
else:
break
finished_sentence.append(word)
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.

my index.py file be like
from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))
my index.php file be like
<html>
<head>
<title>py script</title>
</head>
<body>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
<?php
$python = `python index.py`;
echo $python;
?>
</body>
</html>
Hope this will work

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Python to parse pdf and extract Author and Book name - python

Related

Check if there are numbers around a keyword in a text file

Extracting the Last Name from a list of Full Names using Python / Pandas and possibly Regular Expressions

How to find and rank all prefixes in a list of strings?

Remove items in string paragraph if they belong to a list of strings?

fixing words with spaces using a dictionary look up in python?

Categories

Resources