Create word lists from medical journals - python

I have been asked to compile a crossword for a surgeon's publication, - it comes out quarterly. I need to make it medically oriented, preferably using different specialty words. eg some will be orthopaedics, some cardiac surgery and some human anatomy etc.
I can get surgical journals online.
I want to create word lists for each specialty and use them in the compiler. I will use crossword compiler .
I can use journal articles on the web, or downloaded pdf's. I am a surgeon and use pandas for data analysis but my python skills are a bit primitive so I need relatively simple solutions. How can I create the specific word lists for each surgical specialty.
They don't need to be very specific words, so eg I thought I could scrape the journal volume for words, compare them to a list of common words and delete those leaving me with a technical list. May require some trial and error. I havent used beautiful soup before but willing to try it.
Alternatively I could just get rid of the beautful soup step and use endnote to download a few hundred journals and export to txt.
Its the extraction and list making I think i am mainly struggling to conceptualise.

I created this program that you can use to parse through a .txt file to find the most common words. I also included a block of code that will help you to convert a .pdf file to .txt. Hope my approach to the solution helps, good luck with your crossword for the surgeon's publication!
'''
Find the most common words in a txt file
'''
import collections
# The re module provides regular expression matching operations
import re
'''
Use this if you would like to convert a PDF to a txt file
'''
# import PyPDF2
# pdffileobj=open('textFileName.pdf','rb')
# pdfreader=PyPDF2.PdfFileReader(pdffileobj)
# x=pdfreader.numPages
# pageobj=pdfreader.getPage(x-1)
# text=pageobj.extractText()
# file1=open(r"(folder path)\\textFileName.txt","a")
# file1.writelines(text)
# file1.close()
words = re.findall(r'\w+', open('textFileName.txt').read().lower())
most_common = collections.Counter(words).most_common(10)
print(most_common)

Related

How to do chapter analysis from books imported from nltk.corpus.gutenberg.fileids()

I am a newbie using python. Now I am doing natural language processing for a novel, and I choose to load the book from nltk.corpus.gutenberg.fileids(). I just use 'Sense and Sensibility'. Then I want to analyze each chapter. How to split the whole book into parts? I notice that the books loaded this way has unique format. It's not like txt format.
import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()
When I print the book out, it shows:
['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', ...]
sense = nltk.Text(nltk.corpus.gutenberg.words('austen-sense.txt'))
print(sense)
Then here is another format: <Text: Sense and Sensibility by Jane Austen 1811> I don't know what it means.
If I use another .txt book source, I also don't know how to split the chapters. I've uploaded the book into the folder, then:
text = 'senseText.txt'
It's not like txt format.
If you want something more like the whole text, try:
raw = nltk.Text(nltk.corpus.gutenberg.raw('austen-sense.txt'))
If you want individual sentences, you can use:
sentences = nltk.Text(nltk.corpus.gutenberg.sents('austen-sense.txt'))
Gutenberg doesn't break up the text by chapters for you. (Many of the original sources didn't have chapters to begin with.) If your specific text happens to include chapter breaks in the raw, you could try searching for those, but it'd be text-specific.

Python scraping an unstructured PDF

We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence page.
Ideally I would like to be able to write a python app to be able to scrape certain sections out of the PDF. The structure is pretty much as follows (with the bold parts being the ones I want to extract):
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X
description
Defect fixes
description
table with defect descriptions
rest of the document is irrelevant in this case
I have managed to get it to import the file and extract (all) of the text, but I have really got no idea how to extract only the headings for section 2, and then for section 3 only take the table and reformat it with pandas. Any suggestions on how to go about this ?
import fitz
filename = '~\releasenotes.pdf'
doc = fitz.open(filename)
print (doc) # Just to see what comes out
(and now what should I do next ?)
A simple regex (regular expression) should do the trick here. I'm making some big assumptions around what the text looks like when it comes out of your pdf read - I have copied the text from your post and called it "doc" per your question :)
import re #regular expression library
doc = '''
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X description
'''
ds_features = pd.Series(re.findall('2.[1-9].*\n', doc))
Let me unpack that last line:
re.findall will produce a list of items in your document that matches the search string
'2.[1-9].*\n' will find all instances of a 2. followed by any number from [1-9], followed by any number of characters .* until it reaches a line break \n.
Hope this fits the bill?

how to restore a splitted word by removing the hyphen "-" because of hyphenation in a paragraph using python

simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?
I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/

Sentence clustering

I have a huge number of names from different sources.
I need to extract all the groups (part of the names), which repeat from one to another.
In the example below program should locate: Post, Office, Post Office.
I need to get popularity count.
So I want to extract a sorted by popularity list of phrases.
Here is an example of names:
Post Office - High Littleton
Post Office Pilton Outreach Services
Town Street Post Office
post office St Thomas
Basically need to find out some algorithm or better library, to get such results:
Post Office: 16999
Post: 17934
Office: 16999
Tesco: 7300
...
Here is the full example of names.
I wrote a code which is fine for single words, but not for sentences:
from textblob import TextBlob
import operator
title_file = open("names.txt", 'r')
blob = TextBlob(title_file.read())
list = sorted(blob.word_counts.items(), key=operator.itemgetter(1))
print list
You are not looking for clustering (and that is probably why "all of them suck" for #andrewmatte).
What you are looking for is word counting (or more precisely, n-gram-counting). Which is actually a much easier problem. Thst is why you won't be finding any library for that...
Well, actually you jave some libraries. In python, for example, the collections module has the class Counter that has much of the reusable code.
An untested, very basic code:
from collections import Counter
counter = Counter()
for s in sentences:
words = s.split(" ")
for i in range(len(words)):
counter.add(words[i])
if i > 0: counter.add((words[i-1], words[i]))
You csn get the most frequent from counter. If you want words and word pairs separate, feel free to use two counters. If you need longer phrases, add an inner loop. You may also want to clean sentences (e.g. lowercase) and use a regexp for splitting.
Are you looking for something like this?
workspace={}
with open('names.txt','r') as f:
for name in f:
if len(name): # makes sure line isnt empty
if name in workspace:
workspace[name]+=1
else:
workspace[name]=1
for name in workspace:
print "{}: {}".format(name,workspace[name])

Create an index of the content of each file in a folder

I'm making a search tool in Python.
Its objective is to be able to search files by their content. (we're mostly talking about source file, text files, not images/binary - even if searching in their METADATA would be a great improvment). For now I don't use regular expression, casual plain text.
This part of the algorithm works great !
The problem is that I realize I'm searching mostly in the same few folders, I'd like to find a way to build an index of the content of each files in a folder. And be able as fast as possible to know if the sentence I'm searching is in xxx.txt or if it can't be there.
The idea for now is to maintain a checksum for each file that makes me able to know if it contains a particular string.
Do you know any algorithm close to this ?
I don't need a 100% success rate, I prefer a little index than a big one with 100% success.
The idea is to provide a generic tool.
EDIT : To be clear, I want to search a PART of the content of the file. So making a md5 hash of all its content & comparing it with the hash of what i'm searching isn't a good idea ;)
here i am using whoosh lib to make searching/indexing.. .upper part is indexing the files and the lower part is demo search.. .
#indexing part
from whoosh.index import create_in
from whoosh.fields import *
import os
import stat
import time
schema = Schema(FileName=TEXT(stored=True), FilePath=TEXT(stored=True), Size=TEXT(stored=True), LastModified=TEXT(stored=True),
LastAccessed=TEXT(stored=True), CreationTime=TEXT(stored=True), Mode=TEXT(stored=True))
ix = create_in("./my_whoosh_index_dir", schema)
writer = ix.writer()
for top, dirs, files in os.walk('./my_test_dir'):
for nm in files:
fileStats = os.stat(os.path.join(top, nm))
fileInfo = {
'FileName':nm,
'FilePath':os.path.join(top, nm),
'Size' : fileStats [ stat.ST_SIZE ],
'LastModified' : time.ctime ( fileStats [ stat.ST_MTIME ] ),
'LastAccessed' : time.ctime ( fileStats [ stat.ST_ATIME ] ),
'CreationTime' : time.ctime ( fileStats [ stat.ST_CTIME ] ),
'Mode' : fileStats [ stat.ST_MODE ]
}
writer.add_document(FileName=u'%s'%fileInfo['FileName'],FilePath=u'%s'%fileInfo['FilePath'],Size=u'%s'%fileInfo['Size'],LastModified=u'%s'%fileInfo['LastModified'],LastAccessed=u'%s'%fileInfo['LastAccessed'],CreationTime=u'%s'%fileInfo['CreationTime'],Mode=u'%s'%fileInfo['Mode'])
writer.commit()
## now the seaching part
from whoosh.qparser import QueryParser
with ix.searcher() as searcher:
query = QueryParser("FileName", ix.schema).parse(u"hsbc") ## here 'hsbc' is the search term
results = searcher.search(query)
for x in results:
print x['FileName']
It's not the most efficient, but just uses the stdlib and a little bit of work. sqlite3 (if it's enabled on compilation) supports full text indexing. See: http://www.sqlite.org/fts3.html
So you could create a table of [file_id, filename], and a table of [file_id, line_number, line_text], and use those to base your queries on. ie: how many files contain this word and that line, what lines contain this AND this but not etc...
The only reason anyone would want a tool that is capable of searching 'certain parts' of a file is because what they are trying to do is analyze data that has legal restrictions on which parts of it you can read.
For example, Apple has the capability of identifying the GPS location of your iPhone at any moment a text was sent or received. But, what they cannot legally do is associate that location data with anything that can be tied to you as an individual.
On a broad scale you can use obscure data like this to track and analyze patterns throughout large amounts of data. You could feasibly assign a unique 'Virtual ID' to every cell phone in the USA and log all location movement; afterward you implement a method for detecting patterns of travel. Outliers could be detected through deviations in their normal travel pattern. That 'metadeta' could then be combined with data from outside sources such as names and locations of retail locations. Think of all the situations you might be able to algorithmically detect. Like the soccer dad who for 3 years has driven the same general route between work, home, restaurants, and a little league field. Only being able to search part of a file still offers enough data to detect that Soccer Dad's phone's unique signature suddenly departed from the normal routine and entered a gun shop. The possibilities are limitless. That data could be shared with local law enforcement to increase street presence in public spaces nearby; all while maintaining anonymity of the phone's owner.
Capabilities like the example above are not legally possible in today's environment without the method IggY is looking for.
On the other hand, it could just be that he is only looking for certain types of data in certain file types. If he knows where in the file he wants to search for the data he needs he can save major CPU time only reading the last half or first half of a file.
You can do a simple name-based cache as below. This is probably best (fastest) if the file contents is not expected to change. Otherwise, you can MD5 the file contents. I say MD5 because it's faster than SHA, and this application doesn't seem security sensitive.
from hashlib import md5
import os
info_cache = {}
for file in files_to_search:
file_info = get_file_info(file)
file_hash = md5(os.path.abspath(file)).hexdigest()
info_cache[file_hash]=file_info

Categories