I have a list of keywords that are stored in a json file called vocations.json and a database that contains more than 50000 records.
There are Wikipedia links for many records. And via connecting to Wikipedia, I am searching all the keywords for each record and trying to find if the keywords are passing in the first paragraph of the biographies of records.
The codes below are finding the keywords, however I need a more clever algorithm that the program will evaluate the keyword with the context of the text.
import re
import json
import requests
from bs4 import BeautifulSoup as BS
def get_text(url):
r = requests.get(url, timeout=5)
div = BS(r.content, "html.parser").select_one(".mw-content-ltr")
p = BS(str(div), "html.parser").find_all("p")
try:
return [i.text for i in p if i.text != "\n"][0]
except IndexError:
return
def find_occupations(url, keywords):
text = get_text(url=url)
if not text:
return url, None
occupations = []
for keyword in keywords:
for i in re.findall(f"\s{keyword.lower()}", text.lower()):
if keyword not in occupations:
occupations.append(keyword)
return url, occupations
with open("vocations.json") as f:
words = json.load(f)
For some records, the above code is finding the keywords correctly. Below you can see an example for the correct matching:
url1 = "https://en.wikipedia.org/wiki/Gerolamo_Cardano"
print(find_occupations(url1, words))
The first paragraph of the above url is below:
Gerolamo (also Girolamo[3] or Geronimo[4]) Cardano (Italian: [dʒeˈrɔlamo karˈdano]; French: Jérôme Cardan; Latin: Hieronymus Cardanus; 24 September 1501 – 21 September 1576) was an Italian polymath, whose interests and proficiencies ranged from being a mathematician, physician, biologist, physicist, chemist, astrologer, astronomer, philosopher, writer, and gambler.[5] He was one of the most influential mathematicians of the Renaissance, and was one of the key figures in the foundation of probability and the earliest introducer of the binomial coefficients and the binomial theorem in the Western world. He wrote more than 200 works on science.[6]
The output that I am getting is below:
('https://en.wikipedia.org/wiki/Gerolamo_Cardano', ['Astrologer', 'Astronomer', 'Biologist', 'Chemist', 'Gambler', 'Mathematician', 'Philosopher', 'Physician', 'Physicist', 'Polymath', 'Writer'])
But for some records as below, I am getting wrong results.
url2 = "http://en.wikipedia.org/wiki/Barbara_Villiers"
print(find_occupations(url2, words))
The first paragraph of the above url is below:
Barbara Palmer, 1st Duchess of Cleveland (27 November [O.S. 17 November] 1640[1] – 9 October 1709), more often known by her maiden name Barbara Villiers or her title of Countess of Castlemaine, was an English royal mistress of the Villiers family and perhaps the most notorious of the many mistresses of King Charles II of England, by whom she had five children, all of them acknowledged and subsequently ennobled. Barbara was the subject of many portraits, in particular by court painter Sir Peter Lely. In the Gilded Age, it was stylish to adorn an estate with her likeness.
Below you are seeing the output that I am getting which is not totally correct.
('http://en.wikipedia.org/wiki/Barbara_Villiers', ['King', 'Mistress', 'Painter'])
I know why the program is finding the keywords "King", "Painter", even though they are not the features of Barbara Villiers. Because these keywords too are stored in the json file and they too are passing in the first paragraph of Wikipedia page.
My first question is that, is there a way to find the keywords correctly via evaluating the context of the text? If so, what are your suggestions?
Second question is that, if we can search and find the word using a method that can evaluate the searched word with the context of the text, would it ultimately be necessary to review all 50000 records to see if the algorithm produced an accurate result?
Edit: Below you are seeing some items of vocations.json file.
[
"Accessory designer",
"Acoustical engineer",
"Acrobat",
"Actor",
"Actress",
"Advertising designer",
"Aeronautical engineer",
"Aerospace engineer",
"Agricultural engineer",
"Anesthesiologist",
"Anesthesiologist Assistant",
"Animator",
"Anthropologist",
"Applied engineer",
"Arborist",
"Archaeologist",
"Archimime",
"Architect",
"Army officer",
"Art administrator",
"Artisan",
[...]
]
Question 1: Is there a way to find the keywords correctly via evaluating the context of the text? If so, what are your suggestions?
Keyword detection (also known as keyword extraction) falls under natural language processing (NLP).
Some of the techniques for keyword extraction include:
word collocations and co-occurrences
TF-IDF (short for term frequency–inverse document frequency)
RAKE (Rapid Automatic Keyword Extraction)
Support Vector Machines (SVM)
deep learning
Question 2: If we can search and find the word using a method that can evaluate the searched word with the context of the text, would it ultimately be necessary to review all 50,000 records to see if the algorithm produced an accurate result?
Developing a statistical model may not require training data, whereas, building a deep learning model may require significant data. So, it all depends on which approach is used.
Related
I have been asked to compile a crossword for a surgeon's publication, - it comes out quarterly. I need to make it medically oriented, preferably using different specialty words. eg some will be orthopaedics, some cardiac surgery and some human anatomy etc.
I can get surgical journals online.
I want to create word lists for each specialty and use them in the compiler. I will use crossword compiler .
I can use journal articles on the web, or downloaded pdf's. I am a surgeon and use pandas for data analysis but my python skills are a bit primitive so I need relatively simple solutions. How can I create the specific word lists for each surgical specialty.
They don't need to be very specific words, so eg I thought I could scrape the journal volume for words, compare them to a list of common words and delete those leaving me with a technical list. May require some trial and error. I havent used beautiful soup before but willing to try it.
Alternatively I could just get rid of the beautful soup step and use endnote to download a few hundred journals and export to txt.
Its the extraction and list making I think i am mainly struggling to conceptualise.
I created this program that you can use to parse through a .txt file to find the most common words. I also included a block of code that will help you to convert a .pdf file to .txt. Hope my approach to the solution helps, good luck with your crossword for the surgeon's publication!
'''
Find the most common words in a txt file
'''
import collections
# The re module provides regular expression matching operations
import re
'''
Use this if you would like to convert a PDF to a txt file
'''
# import PyPDF2
# pdffileobj=open('textFileName.pdf','rb')
# pdfreader=PyPDF2.PdfFileReader(pdffileobj)
# x=pdfreader.numPages
# pageobj=pdfreader.getPage(x-1)
# text=pageobj.extractText()
# file1=open(r"(folder path)\\textFileName.txt","a")
# file1.writelines(text)
# file1.close()
words = re.findall(r'\w+', open('textFileName.txt').read().lower())
most_common = collections.Counter(words).most_common(10)
print(most_common)
I've been searching for this for a long time and most of the materials I've found were entity named recognition. I'm running topic modeling but in my data, there were too many names in the texts.
Is there any python library which contains (English) names of people? or if not, what would be a good way to remove names of people from each document in corpus?
Here's a simple example:
texts=['Melissa\'s home was clean and spacious. I would love to visit again soon.','Kevin was nice and Kevin\'s home had a huge parking spaces.']
I would suggest using a tokenizer with some capability to recognize and differentiate proper nouns. spacy is quite versatile and its default tokenizer does a decent job of this.
There are hazards to using a list of names as if they're stop words - let me illustrate:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."
"Bill sold a work of art to Art and gave him a bill"]
tokenList = []
for i, sentence in enumerate(texts):
doc = nlp(sentence)
for token in doc:
tokenList.append([i, token.text, token.lemma_, token.pos_, token.tag_, token.dep_])
tokenDF = pd.DataFrame(tokenList, columns=["i", "text", "lemma", "POS", "tag", "dep"]).set_index("i")
So the first two sentences are easy, and spacy identifies the proper nouns "PROPN":
Now, the third sentence has been constructed to show the issue - lots of people have names that are also things. spacy's default tokenizer isn't perfect, but it does a respectable job with the two sides of the task: don't remove names when they are being used as regular words (e.g. bill of goods, work of art), and do identify them when they are being used as names. (you can see that it messed up one of the references to Art (the person).
Not sure if this solution is efficient and robust but it's simple to understand (to me at the very least):
import re
# get a list of existed names (over 18 000) from the file
with open('names.txt', 'r') as f:
NAMES = set(f.read().splitlines())
# your list of texts
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."]
# join the texts into one string
texts = ' | '.join(texts)
# find all the words that look like names
pattern = r"(\b[A-Z][a-z]+('s)?\b)"
found_names = re.findall(pattern, texts)
# get singular forms, and remove doubles
found_names = set([name[0].replace("'s","") for name in found_names])
# remove all the words that look like names but are not included in the NAMES
found_names = [name for name in found_names if name in NAMES]
# loop trough the found names and remove every name from the texts
for name in found_names:
texts = re.sub(name + "('s)?", "", texts) # include plural forms
# split the texts back to the list
texts = texts.split(' | ')
print(texts)
Output:
[' home was clean and spacious. I would love to visit again soon.',
' was nice and home had a huge parking spaces.']
List of the names was obtained here: https://www.usna.edu/Users/cs/roche/courses/s15si335/proj1/files.php%3Ff=names.txt.html
And I completely endorse the recommendation of #James_SO to use more smart tools.
simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?
I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/
I'm trying to extract text from the online version of The Wealth of Nations and create a data frame where each observation is a page of the book. I do it in a roundabout way, trying to imitate something similar I did in R, but I was wondering if there was a way to do this directly in BeautifulSoup.
What I do is first get the entire text from the page:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.gutenberg.org/files/38194/38194-h/38194-h.htm')
soup = BeautifulSoup(r.text,'html.parser')
But from here on, I'm just working with regular expressions and the text. I find the beginning and end of the book text:
beginning = [a.start() for a in re.finditer(r"BOOK I\.",soup.text)]
beginning
end = [a.start() for a in re.finditer(r"FOOTNOTES",soup.text)]
book = soup.text[beginning[1]:end[0]]
Then I remove the carriage returns and new lines and split on strings of the form "[Pg digits]" and put everything into a pandas data frame.
book = book.replace('\r',' ').replace('\n',' ')
l = re.compile('\[[P|p]g\s?\d{1,3}\]').split(book)
df = pd.DataFrame(l,columns=['col1'])
df['page'] = range(2,df.shape[0]+2)
There are indicators in the HTML code for page numbers of the form <span class='pagenum'><a name="Page_vii" id="Page_vii">[Pg vii]</a></span>. Is there a way I can do the text extraction in BeautifulSoup by searching for text between these "spans"? I know how to search for the page markers using findall, but I was wondering how I can extract text between those markers.
To get the page markers and the text associated with it, you can use bs4 with re. In order to match text between two markers, itertools.groupby can be used:
from bs4 import BeautifulSoup as soup
import requests
import re
import itertools
page_data = requests.get('https://www.gutenberg.org/files/38194/38194-h/38194-h.htm').text
final_data = [(i.find('a', {'name':re.compile('Page_\w+')}), i.text) for i in soup(page_data, 'html.parser').find_all('p')]
new_data = [list(b) for a, b in itertools.groupby(final_data, key=lambda x:bool(x[0]))][1:]
final_data = {new_data[i][0][0].text:'\n'.join(c for _, c in new_data[i+1]) for i in range(0, len(new_data), 2)}
Output (Sample, the actual result is too long for SO format):
{'[Pg vi]': "'In recompense for so many mortifying things, which nothing but truth\r\ncould have extorted from me, and which I could easily have multiplied to a\r\ngreater number, I doubt not but you are so good a christian as to return good\r\nfor evil, and to flatter my vanity, by telling me, that all the godly in Scotland\r\nabuse me for my account of John Knox and the reformation.'\nMr. Smith having completed, and given to the world his system of\r\nethics, that subject afterwards occupied but a small part of his lectures.\r\nHis attention was now chiefly directed to the illustration of\r\nthose other branches of science which he taught; and, accordingly, he\r\nseems to have taken up the resolution, even at that early period, of\r\npublishing an investigation into the principles of what he considered\r\nto be the only other branch of Moral Philosophy,—Jurisprudence, the\r\nsubject of which formed the third division of his lectures. At the\r\nconclusion of the Theory of Moral Sentiments, after treating of the\r\nimportance of a system of Natural Jurisprudence, and remarking that\r\nGrotius was the first, and perhaps the only writer, who had given any\r\nthing like a system of those principles which ought to run through,\r\nand be the foundation of the law of nations, Mr. Smith promised, in\r\nanother discourse, to give an account of the general principles of law\r\nand government, and of the different revolutions they have undergone\r\nin the different ages and periods of society, not only in what concerns\r\njustice, but in what concerns police, revenue, and arms, and whatever\r\nelse is the object of law.\nFour years after the publication of this work, and after a residence\r\nof thirteen years in Glasgow, Mr. Smith, in 1763, was induced to relinquish\r\nhis professorship, by an invitation from the Hon. Mr. Townsend,\r\nwho had married the Duchess of Buccleugh, to accompany the\r\nyoung Duke, her son, in his travels. Being indebted for this invitation\r\nto his own talents alone, it must have appeared peculiarly flattering\r\nto him. Such an appointment was, besides, the more acceptable,\r\nas it afforded him a better opportunity of becoming acquainted with\r\nthe internal policy of other states, and of completing that system of\r\npolitical economy, the principles of which he had previously delivered\r\nin his lectures, and which it was then the leading object of his studies\r\nto perfect.\nMr. Smith did not, however, resign his professorship till the day\r\nafter his arrival in Paris, in February 1764. He then addressed the\r\nfollowing letter to the Right Honourable Thomas Millar, lord advocate\r\nof Scotland, and then rector of the college of Glasgow:—", '[Pg vii]': "His lordship having transmitted the above to the professors, a meeting\r\nwas held; on which occasion the following honourable testimony\r\nof the sense they entertained of the worth of their former colleague\r\nwas entered in their minutes:—\n'The meeting accept of Dr. Smith's resignation in terms of the above letter;\r\nand the office of professor of moral philosophy in this university is therefore\r\nhereby declared to be vacant. The university at the same time, cannot\r\nhelp expressing their sincere regret at the removal of Dr. Smith, whose distinguished\r\nprobity and amiable qualities procured him the esteem and affection\r\nof his colleagues; whose uncommon genius, great abilities, and extensive\r\nlearning, did so much honour to this society. His elegant and ingenious\r\nTheory of Moral Sentiments having recommended him to the esteem of men\r\nof taste and literature throughout Europe, his happy talents in illustrating\r\nabstracted subjects, and faithful assiduity in communicating useful knowledge,\r\ndistinguished him as a professor, and at once afforded the greatest pleasure,\r\nand the most important instruction, to the youth under his care.'\nIn the first visit that Mr. Smith and his noble pupil made to Paris,\r\nthey only remained ten or twelve days; after which, they proceeded\r\nto Thoulouse, where, during a residence of eighteen months, Mr. Smith\r\nhad an opportunity of extending his information concerning the internal\r\npolicy of France, by the intimacy in which he lived with some of\r\nthe members of the parliament. After visiting several other places in\r\nthe south of France, and residing two months at Geneva, they returned\r\nabout Christmas to Paris. Here Mr. Smith ranked among his\r\nfriends many of the highest literary characters, among whom were\r\nseveral of the most distinguished of those political philosophers who\r\nwere denominated Economists."}
I want to extract relevant information about few topics. for example:
product information
purchase experience of customer
recommendation of family or friend
In first step, I extract information from one of the website. for instance :
i think AIA does a more better life insurance as my comparison and
the companies comparisonand most important is also medical insurance
in my opinionyes there are some agents that will sell u plans that
their commission is high...dun worry u buy insurance from a company
anything happens u can contact back the company also can ...better
find a agent that is reliable and not just working for the commission
for now , they might not service u in the future...thanksregardsdiana
""
Then by using NLTK in VS2015, I tried to split words.
toks = nltk.word_tokenize(text)
By using pos_tag I can tag my toks
postoks = nltk.tag.pos_tag(toks)
from this part I am not sure what should I do?
Previously, I used IBM text Analytic. In this software I use to create dictionary and then create some pattern and then analysis the data. for instance
:
Sample of Dictionary: insurance_cmp : {AIA, IMG, SABB}
Sample of pattern:
insurance_cmp + Good_Feeling_Pattern
insurance_cmp + ['purchase|Buy'] + Bad_Feeling_Pattern
Good_Feeling_Pattern = [good, like it, nice]
Bad_Feeling_Pattern = [bad, worse, not good, regret]
I tried to know can I simulate the same in NLKT? chunker and create grammar can help me to extract what I am looking for? may I have your idea to improve myself please?
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)
Please help me what could be my next step to reach to my goal?
You just need to follow these video
or read this blog.