How to split text into paragraphs using NLTK nltk.tokenize.texttiling? - python

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html.
When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me.
tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)
tiles = tt.tokenize(text) # same text returned
What I have are emails that follow this basic structure
From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL
If we call this email string s, it would look like
s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
*** Not all emails follow this same structure or have the same wording, so I can't use regular expressions.

What about using splitlines? Or do you have to use the nltk package?
email = """ From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
y = [s.strip() for s in email.splitlines()]
print(y)

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
The texttiling algorithm {1,4,5} isn't designed to perform sequential text classification {2,3} (which is the task you described). Instead, from http://people.ischool.berkeley.edu/~hearst/research/tiling.html:
TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.
References:
{1} Marti A. Hearst, Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf
{2} Lee, J.Y. and Dernoncourt, F., 2016, June. Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 515-520). https://www.aclweb.org/anthology/N16-1062.pdf
{3} Dernoncourt, Franck, Ji Young Lee, and Peter Szolovits. "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 694-700. 2017. https://www.aclweb.org/anthology/E17-2110.pdf
{4} Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics, 23 (1), pp. 33-64, March 1997. pdf
{5} Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text Segmentation, Computational Linguistics, 28 (1), March 2002, pp. 19-36. pdf

Related

How to label multi-word entities?

I'm quite new to data analysis (and Python in general), and I'm currently a bit stuck in my project.
For my NLP-task I need to create training data, i.e. find specific entities in sentences and label them. I have multiple csv files containing the entities I am trying to find, many of them consisting of multiple words. I have tokenized and lemmatized the unlabeled sentences with spaCy and loaded them into a pandas.DataFrame.
My main problem is: how do I now compare the tokenized sentences with the entity-lists and label the (often multi-word) entities? Having around 0.5 GB of sentences, I don't think it is feasible to just for-loop every sentence and then for-loop every entity in every class-list and do a simple substring-search. Is there any smart way to use pandas.Series or DataFrame to do this labeling?
As mentioned, I don't really have any experience regarding pandas/numpy etc. and after a lot of web searching I still haven't seemed to find the answer to my problem
Say that this is a sample of finance.csv, one of my entity lists:
"Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation",
...
And that this is a sample of sport.csv, another one of my entity lists:
"Christiano Ronaldo",
"Lewis Hamilton",
...
And an example (dumb) sentence:
"Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"
The result I'd like would be something like a table of tokens with the matching entity labels (with IOB labeling):
"Dear "- O
"members" - O
"of" - O
"Frontwave" - B-FINANCE
"Credit" - I-FINANCE
"Union" - I-FINANCE
"," - O
"any" - O
...
"Lewis" - B-SPORT
"Hamilton" - I-SPORT
...
"said" - O
"Ronaldo" - O
Use:
FINANCE = ["Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation"]
SPORT = [
"Christiano Ronaldo",
"Lewis Hamilton",
]
FINANCE = '|'.join(FINANCE)
sent = pd.DataFrame({'sent': ["Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"]})
home = sent['sent'].str.extractall(f'({FINANCE})')
def labeler(row, group):
l = len(row.split())
return [f'I-{group}' if i !=0 else f'B-{group}' for i in range(l)]
home[0].apply(labeler, group='FINANCE').explode()

Why are WDT words being marked as a sentence subject by dependency parsing?

I want to report the subject of each sentence; and also extract all its modifiers. (E.g. "Donald Trump" not just "Trump"; "(The) average remaining lease term" not just "term".)
Here is my test code:
import spacy
nlp = spacy.load('en_core_web_sm')
def handle(doc):
for sent in doc.sents:
shownSentence = False
for token in sent:
if(token.dep_=="nsubj"):
if(not shownSentence):
print("----------")
print(sent)
shownSentence = True
print("{0}/{1}".format(token.text, token.tag_))
print([ [t,t.tag_] for t in token.children])
handle(nlp('Donald Trump, legend in his own lifetime, said: "This transaction is a continuation of our main strategy to invest in assets which offer growth potential and that are coloured pink." The average remaining lease term is six years, and Laura Palmer was killed by Bob. Trump added he will sell up soon.'))
The output is below. I'm wondering why I get "which/WDT" as a subject? Is it just model noise, or is it considered correct behaviour? (Incidentally, in my real sentence, which had the same structure, I also got "that/WDT" being marked as a subject.) (UPDATE: If I switch to 'en_core_web_md' then I do get "that/WDT" for my Trump example; that is the only difference switching from the small to the medium model makes.)
I can easily filter them out by looking at tag_; I'm more interested in the underlying reason.
(UPDATE: Incidentally, "Laura Palmer" doesn't get pulled out as a subject by this code, as the dep_ value is "nsubjpass", not "nsubj".)
----------
Donald Trump, legend in his own lifetime, said: "This transaction is a continuation of our main strategy to invest in assets which offer growth potential and that are coloured pink."
Trump/NNP
[[Donald, 'NNP'], [,, ','], [legend, 'NN'], [,, ',']]
transaction/NN
[[This, 'DT']]
which/WDT
[]
----------
The average remaining lease term is six years, and Laura Palmer was killed by Bob.
term/NN
[[The, 'DT'], [average, 'JJ'], [remaining, 'JJ'], [lease, 'NN']]
----------
Trump added he will sell up soon.
Trump/NNP
[]
he/PRP
[]
(By the way, the bigger picture: pronoun resolution. I want to turn PRPs into the text they refer to.)

Regex to Remove Name, Address, Designation from an email text Python

I have a sample text of an email like this. I want to keep only the body of the text and remove names, address, designation, company name, email address from the text. So, to be clear, I only want the content of each mails between the From Dear/Hi/Hello to Sincerely/Regards/Thanks. How to do this efficiently using a regex or some other way
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Hi Roger,
Yes, an extension until June 22, 2018 is acceptable.
Regards,
Loren
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Dear Loren,
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
Best Regards,
Mr. Roger
Global Director
roger#abc.com
78 Ford st.
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
responding by June 15, 2018.check email for updates
Hello,
John Doe
Senior Director
john.doe#pqr.com
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
Warm Regards,
Mr. Roger
Global Director
roger#abc.com
78 Ford st.
Center for Research
Office of New Discoveries
Food and Drug Administration
Loren#mno.com
From this text I only want as OUTPUT :
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Yes, an extension until June 22, 2018 is acceptable.
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
responding by June 15, 2018.check email for updates
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
Below is an answer that works for your current input. The code will have to be adjusted when you process examples that fall outside the parameters outlined in the code below.
with open('email_input.txt') as input:
# List to store the cleaned lines
clean_lines = []
# Reads until EOF
lines = input.readlines()
# Remove some of the extra lines
no_new_lines = [i.strip() for i in lines]
# Convert the input to all lowercase
lowercase_lines = [i.lower() for i in no_new_lines]
# Boolean state variable to keep track of whether we want to be printing lines or not
lines_to_keep = False
for line in lowercase_lines:
# Look for lines that start with a subject line
if line.startswith('subject: [external]'):
# set lines_to_keep true and start capturing lines
lines_to_keep = True
# Look for lines that start with a salutation
elif line.startswith("regards,") or line.startswith("warm regards,") \
or line.startswith("best regards,") or line.startswith("hello,"):
# set lines_to_keep false and stop capturing lines
lines_to_keep = False
if lines_to_keep:
# regex to catch greeting lines
greeting_component = re.compile(r'(dear.*,|(hi.*,))', re.IGNORECASE)
remove_greeting = re.match(greeting_component, line)
if not remove_greeting:
if line not in clean_lines:
clean_lines.append(line)
for item in clean_lines:
print (item)
# output
subject: [external] re: query regarding supplement 73
yes, an extension until june 22, 2018 is acceptable.
we had initial discussion with the abc team us know if you would be able to
extend the response due date to june 22, 2018.
responding by june 15, 2018.check email for updates
please refer to your january 12, 2018 data containing labeling supplements
to add text regarding this symptom. we are currently reviewing your
supplements and have made additional edits to your label.
feel free to contact me with any questions.

Geocode the address written in native language using English letters

Friends,
I am analyzing some texts. My requirement is to gecode the address written in English letters of a different native language.
Ex: chandpur market ke paas, village gorthaniya, UP, INDIA
In above sentence words like, "ke paas" --> is a HINDI word (Indian national language), which means "near" in English and "chandapur market" is a noun (can be ignored for conversion)
Now my challenge is to convert such thousands of words to english and identify the street name and geo code it. Unfortunately, i do not have postal code or exact address.
Can you any one please help here?
Thanks in Advance !!
Google's geocode api supports Hindi, so you don't necessarily have to translate it to English. Here's an example using my googleway package (in R) specifying the language = "hi" argument.
You'll need an API key to use the Google API through googleway
library(googleway)
set_key("your_api_key")
res <- google_geocode(address = "village gorthaniya, UP, INDIA",
language = "hi")
geocode_address(res)
# [1] "गोर्थानिया, उत्तर प्रदेश 272181, भारत"
geocode_coordinates(res)
# lat lng
# 1 26.85848 82.50099
geocode_address_components(res)
# long_name short_name types
# 1 गोर्थानिया गोर्थानिया locality, political
# 2 बस्ती बस्ती administrative_area_level_2, political
# 3 उत्तर प्रदेश उ॰ प्र॰ administrative_area_level_1, political
# 4 भारत IN country, political
# 5 272181 272181 postal_code

BeautifulSoup: when pulling text from a section, <emph> and other tags are ignored causing adjacent words to be pushed together

I have an XML document. I want to pull all text between all .. <.p> tags. Below is an example of the text. The problem is that in a sentence like:
"Because the <emph>raspberry</emph> and.."
the output is "Because theraspberryand...". Somehow, the emph tags are being dropped (which is good) but being dropped in a way that pushes together the adjacent word.
Here is the relevant code I am using:
xml = BeautifulSoup(xml, convertEntities=BeautifulSoup.HTML_ENTITIES)
for para in xml.findAll('p'):
text = text + " " + para.text + " "
Here is a the start of part of the text, in case the full text helps:
<!DOCTYPE art SYSTEM "keton.dtd">
<art jid="PNAS" aid="1436" vid="94" iss="14" date="07-08-1997" ppf="7349" ppl="7355">
<fm>
<doctopic>Developmental Biology</doctopic>
<dochead>Inaugural Article</dochead>
<docsubj>Biological Sciences</docsubj>
<atl>Suspensor-derived polyembryony caused by altered expression of
valyl-tRNA synthetase in the <emph>twn2</emph>
mutant of <emph>Arabidopsis</emph></atl>
<prs>This contribution is part of the special series of Inaugural
Articles by members of the National Academy of Sciences elected on
April 30, 1996.</prs>
<aug>
<au><fnm>James Z.</fnm><snm>Zhang</snm></au>
<au><fnm>Chris R.</fnm><snm>Somerville</snm></au>
<fnr rid="FN150"><aff>Department of Plant Biology, Carnegie Institution of Washington,
290 Panama Street, Stanford CA 94305</aff>
</fnr></aug>
<acc>May 9, 1997</acc>
<con>Chris R. Somerville</con>
<pubfront>
<cpyrt><date><year>1997</year></date>
<cpyrtnme><collab>The National Academy of Sciences of the USA</collab></cpyrtnme></cpyrt>
<issn>0027-8424</issn><extent>7</extent><price>2.00/0</price>
</pubfront>
<fn id="FN150"><p>To whom reprint requests should be addressed. e-mail:
<email>crs#andrew.stanford.edu</email>.</p>
</fn>
<abs><p>The <emph>twn2</emph> mutant of <emph>Arabidopsis</emph>
exhibits a defect in early embryogenesis where, following one or two
divisions of the zygote, the decendents of the apical cell arrest. The
basal cells that normally give rise to the suspensor proliferate
abnormally, giving rise to multiple embryos. A high proportion of the
seeds fail to develop viable embryos, and those that do, contain a high
proportion of partially or completely duplicated embryos. The adult
plants are smaller and less vigorous than the wild type and have a
severely stunted root. The <emph>twn2-1</emph> mutation, which is the
only known allele, was caused by a T-DNA insertion in the 5′
untranslated region of a putative valyl-tRNA synthetase gene,
<it>valRS</it>. The insertion causes reduced transcription of the
<it>valRS</it> gene in reproductive tissues and developing seeds but
increased expression in leaves. Analysis of transcript initiation sites
and the expression of promoter–reporter fusions in transgenic plants
indicated that enhancer elements inside the first two introns interact
with the border of the T-DNA to cause the altered pattern of expression
of the <it>valRS</it> gene in the <emph>twn2</emph> mutant. The
phenotypic consequences of this unique mutation are interpreted in the
context of a model, suggested by Vernon and Meinke &lsqbVernon, D. M. &
Meinke, D. W. (1994) <emph>Dev. Biol.</emph> 165, 566–573&rsqb, in
which the apical cell and its decendents normally suppress the
embryogenic potential of the basal cell and its decendents during early
embryo development.</p>
</abs>
</fm>
I think the problem here is that you're trying to write bs4 code with bs3.
The obvious fix is to the use bs4 instead.
But in bs3, the docs show two ways to get all of the text recursively from all contents of a soup:
''.join(e for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
''.join(soup.findAll(text=True))
You can obviously change either one of those to explicitly strip whitespace off the edges and add exactly one space between each node instead of relying on whatever space might be there:
' '.join(e.strip() for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
' '.join(map(str.strip, soup.findAll(text=True)))
I wouldn't want to guarantee that this will be exactly the same as the bs4 text property… but I think it's what you want here.

Categories