how to extract information I want by NLKT - python

I want to extract relevant information about few topics. for example:
product information
purchase experience of customer
recommendation of family or friend
In first step, I extract information from one of the website. for instance :
i think AIA does a more better life insurance as my comparison and
the companies comparisonand most important is also medical insurance
in my opinionyes there are some agents that will sell u plans that
their commission is high...dun worry u buy insurance from a company
anything happens u can contact back the company also can ...better
find a agent that is reliable and not just working for the commission
for now , they might not service u in the future...thanksregardsdiana
""
Then by using NLTK in VS2015, I tried to split words.
toks = nltk.word_tokenize(text)
By using pos_tag I can tag my toks
postoks = nltk.tag.pos_tag(toks)
from this part I am not sure what should I do?
Previously, I used IBM text Analytic. In this software I use to create dictionary and then create some pattern and then analysis the data. for instance
:
Sample of Dictionary: insurance_cmp : {AIA, IMG, SABB}
Sample of pattern:
insurance_cmp + Good_Feeling_Pattern
insurance_cmp + ['purchase|Buy'] + Bad_Feeling_Pattern
Good_Feeling_Pattern = [good, like it, nice]
Bad_Feeling_Pattern = [bad, worse, not good, regret]
I tried to know can I simulate the same in NLKT? chunker and create grammar can help me to extract what I am looking for? may I have your idea to improve myself please?
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)
Please help me what could be my next step to reach to my goal?

You just need to follow these video
or read this blog.

Related

Create word lists from medical journals

I have been asked to compile a crossword for a surgeon's publication, - it comes out quarterly. I need to make it medically oriented, preferably using different specialty words. eg some will be orthopaedics, some cardiac surgery and some human anatomy etc.
I can get surgical journals online.
I want to create word lists for each specialty and use them in the compiler. I will use crossword compiler .
I can use journal articles on the web, or downloaded pdf's. I am a surgeon and use pandas for data analysis but my python skills are a bit primitive so I need relatively simple solutions. How can I create the specific word lists for each surgical specialty.
They don't need to be very specific words, so eg I thought I could scrape the journal volume for words, compare them to a list of common words and delete those leaving me with a technical list. May require some trial and error. I havent used beautiful soup before but willing to try it.
Alternatively I could just get rid of the beautful soup step and use endnote to download a few hundred journals and export to txt.
Its the extraction and list making I think i am mainly struggling to conceptualise.
I created this program that you can use to parse through a .txt file to find the most common words. I also included a block of code that will help you to convert a .pdf file to .txt. Hope my approach to the solution helps, good luck with your crossword for the surgeon's publication!
'''
Find the most common words in a txt file
'''
import collections
# The re module provides regular expression matching operations
import re
'''
Use this if you would like to convert a PDF to a txt file
'''
# import PyPDF2
# pdffileobj=open('textFileName.pdf','rb')
# pdfreader=PyPDF2.PdfFileReader(pdffileobj)
# x=pdfreader.numPages
# pageobj=pdfreader.getPage(x-1)
# text=pageobj.extractText()
# file1=open(r"(folder path)\\textFileName.txt","a")
# file1.writelines(text)
# file1.close()
words = re.findall(r'\w+', open('textFileName.txt').read().lower())
most_common = collections.Counter(words).most_common(10)
print(most_common)

how to remove names of people in corpus using python

I've been searching for this for a long time and most of the materials I've found were entity named recognition. I'm running topic modeling but in my data, there were too many names in the texts.
Is there any python library which contains (English) names of people? or if not, what would be a good way to remove names of people from each document in corpus?
Here's a simple example:
texts=['Melissa\'s home was clean and spacious. I would love to visit again soon.','Kevin was nice and Kevin\'s home had a huge parking spaces.']
I would suggest using a tokenizer with some capability to recognize and differentiate proper nouns. spacy is quite versatile and its default tokenizer does a decent job of this.
There are hazards to using a list of names as if they're stop words - let me illustrate:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."
"Bill sold a work of art to Art and gave him a bill"]
tokenList = []
for i, sentence in enumerate(texts):
doc = nlp(sentence)
for token in doc:
tokenList.append([i, token.text, token.lemma_, token.pos_, token.tag_, token.dep_])
tokenDF = pd.DataFrame(tokenList, columns=["i", "text", "lemma", "POS", "tag", "dep"]).set_index("i")
So the first two sentences are easy, and spacy identifies the proper nouns "PROPN":
Now, the third sentence has been constructed to show the issue - lots of people have names that are also things. spacy's default tokenizer isn't perfect, but it does a respectable job with the two sides of the task: don't remove names when they are being used as regular words (e.g. bill of goods, work of art), and do identify them when they are being used as names. (you can see that it messed up one of the references to Art (the person).
Not sure if this solution is efficient and robust but it's simple to understand (to me at the very least):
import re
# get a list of existed names (over 18 000) from the file
with open('names.txt', 'r') as f:
NAMES = set(f.read().splitlines())
# your list of texts
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."]
# join the texts into one string
texts = ' | '.join(texts)
# find all the words that look like names
pattern = r"(\b[A-Z][a-z]+('s)?\b)"
found_names = re.findall(pattern, texts)
# get singular forms, and remove doubles
found_names = set([name[0].replace("'s","") for name in found_names])
# remove all the words that look like names but are not included in the NAMES
found_names = [name for name in found_names if name in NAMES]
# loop trough the found names and remove every name from the texts
for name in found_names:
texts = re.sub(name + "('s)?", "", texts) # include plural forms
# split the texts back to the list
texts = texts.split(' | ')
print(texts)
Output:
[' home was clean and spacious. I would love to visit again soon.',
' was nice and home had a huge parking spaces.']
List of the names was obtained here: https://www.usna.edu/Users/cs/roche/courses/s15si335/proj1/files.php%3Ff=names.txt.html
And I completely endorse the recommendation of #James_SO to use more smart tools.

How to find the keyword in a text considering the context?

I have a list of keywords that are stored in a json file called vocations.json and a database that contains more than 50000 records.
There are Wikipedia links for many records. And via connecting to Wikipedia, I am searching all the keywords for each record and trying to find if the keywords are passing in the first paragraph of the biographies of records.
The codes below are finding the keywords, however I need a more clever algorithm that the program will evaluate the keyword with the context of the text.
import re
import json
import requests
from bs4 import BeautifulSoup as BS
def get_text(url):
r = requests.get(url, timeout=5)
div = BS(r.content, "html.parser").select_one(".mw-content-ltr")
p = BS(str(div), "html.parser").find_all("p")
try:
return [i.text for i in p if i.text != "\n"][0]
except IndexError:
return
def find_occupations(url, keywords):
text = get_text(url=url)
if not text:
return url, None
occupations = []
for keyword in keywords:
for i in re.findall(f"\s{keyword.lower()}", text.lower()):
if keyword not in occupations:
occupations.append(keyword)
return url, occupations
with open("vocations.json") as f:
words = json.load(f)
For some records, the above code is finding the keywords correctly. Below you can see an example for the correct matching:
url1 = "https://en.wikipedia.org/wiki/Gerolamo_Cardano"
print(find_occupations(url1, words))
The first paragraph of the above url is below:
Gerolamo (also Girolamo[3] or Geronimo[4]) Cardano (Italian: [dʒeˈrɔlamo karˈdano]; French: Jérôme Cardan; Latin: Hieronymus Cardanus; 24 September 1501 – 21 September 1576) was an Italian polymath, whose interests and proficiencies ranged from being a mathematician, physician, biologist, physicist, chemist, astrologer, astronomer, philosopher, writer, and gambler.[5] He was one of the most influential mathematicians of the Renaissance, and was one of the key figures in the foundation of probability and the earliest introducer of the binomial coefficients and the binomial theorem in the Western world. He wrote more than 200 works on science.[6]
The output that I am getting is below:
('https://en.wikipedia.org/wiki/Gerolamo_Cardano', ['Astrologer', 'Astronomer', 'Biologist', 'Chemist', 'Gambler', 'Mathematician', 'Philosopher', 'Physician', 'Physicist', 'Polymath', 'Writer'])
But for some records as below, I am getting wrong results.
url2 = "http://en.wikipedia.org/wiki/Barbara_Villiers"
print(find_occupations(url2, words))
The first paragraph of the above url is below:
Barbara Palmer, 1st Duchess of Cleveland (27 November [O.S. 17 November] 1640[1] – 9 October 1709), more often known by her maiden name Barbara Villiers or her title of Countess of Castlemaine, was an English royal mistress of the Villiers family and perhaps the most notorious of the many mistresses of King Charles II of England, by whom she had five children, all of them acknowledged and subsequently ennobled. Barbara was the subject of many portraits, in particular by court painter Sir Peter Lely. In the Gilded Age, it was stylish to adorn an estate with her likeness.
Below you are seeing the output that I am getting which is not totally correct.
('http://en.wikipedia.org/wiki/Barbara_Villiers', ['King', 'Mistress', 'Painter'])
I know why the program is finding the keywords "King", "Painter", even though they are not the features of Barbara Villiers. Because these keywords too are stored in the json file and they too are passing in the first paragraph of Wikipedia page.
My first question is that, is there a way to find the keywords correctly via evaluating the context of the text? If so, what are your suggestions?
Second question is that, if we can search and find the word using a method that can evaluate the searched word with the context of the text, would it ultimately be necessary to review all 50000 records to see if the algorithm produced an accurate result?
Edit: Below you are seeing some items of vocations.json file.
[
"Accessory designer",
"Acoustical engineer",
"Acrobat",
"Actor",
"Actress",
"Advertising designer",
"Aeronautical engineer",
"Aerospace engineer",
"Agricultural engineer",
"Anesthesiologist",
"Anesthesiologist Assistant",
"Animator",
"Anthropologist",
"Applied engineer",
"Arborist",
"Archaeologist",
"Archimime",
"Architect",
"Army officer",
"Art administrator",
"Artisan",
[...]
]
Question 1: Is there a way to find the keywords correctly via evaluating the context of the text? If so, what are your suggestions?
Keyword detection (also known as keyword extraction) falls under natural language processing (NLP).
Some of the techniques for keyword extraction include:
word collocations and co-occurrences
TF-IDF (short for term frequency–inverse document frequency)
RAKE (Rapid Automatic Keyword Extraction)
Support Vector Machines (SVM)
deep learning
Question 2: If we can search and find the word using a method that can evaluate the searched word with the context of the text, would it ultimately be necessary to review all 50,000 records to see if the algorithm produced an accurate result?
Developing a statistical model may not require training data, whereas, building a deep learning model may require significant data. So, it all depends on which approach is used.

how to restore a splitted word by removing the hyphen "-" because of hyphenation in a paragraph using python

simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?
I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?
The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.
To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.

Categories