How can I tokenize this text into sentences with Regex

How can I tokenize this text into sentences with Regex - python

"You could not possibly have come at a better time, my dear Watson,"
he said cordially. 'It is not worth your while to wait,' she went
on."You can pass through the door; no one hinders." And then, seeing that I smiled and shook my head, she suddenly threw aside her
constraint and made a step forward, with her hands wrung together.
Look at the highlighted area. How can I possibly distinguish a case where '"' is followed by a period (.) to end a sentence and a case where a period (.) is followed by a '"'
I have tried this piece for the tokenizer. It works well except for just that one part.
(([^।\.?!]|[।\.?!](?=[\"\']))+\s*[।\.?!]\s*)
Edit: I am not planning to use any NLP toolkit to solve this problem.

Use NLTK instead of regular expressions here:
from nltk import sent_tokenize
parts = sent_tokenize(your_string)
# ['"You could not possibly have come at a better time, my dear Watson," he said cordially.', "'It is not worth your while to wait,' she went on.", '"You can pass through the door; no one hinders."', 'And then, seeing that I smiled and shook my head, she suddenly threw aside her constraint and made a step forward, with her hands wrung together.']

Found this function a while ago
def split_into_sentences(text):
caps = u"([A-Z])"
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
starters = u"(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = u"([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = u"[.](com|net|org|io|gov|mobi|info|edu)"
if not isinstance(text,unicode):
text = text.decode('utf-8')
text = u" {0} ".format(text)
text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = re.sub(u"\s" + caps + u"[.] ",u" \\1<prd> ",text)
text = re.sub(acronyms+u" "+starters,u"\\1<stop> \\2",text)
text = re.sub(caps + u"[.]" + caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>",text)
text = re.sub(u" "+suffixes+u"[.] "+starters,u" \\1<stop> \\2",text)
text = re.sub(u" "+suffixes+u"[.]",u" \\1<prd>",text)
text = re.sub(u" " + caps + u"[.]",u" \\1<prd>",text)
if u"\"" in text: text = text.replace(u".\"",u"\".")
if u"!" in text: text = text.replace(u"!\"",u"\"!")
if u"?" in text: text = text.replace(u"?\"",u"\"?")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences

Related

How extract text from selected books, convert them to tags, and add them to metadata in CALIBRE?

I created a python code in VScode that allows me to perform text searches within an epub book, these searches consist of matching the text of the book with regular expressions. These regular expressions come from patterns that I formulated for the tags in my library. I have already managed to get over 400 tags this way and I have a custom column for them, I add the # symbol at the beginning to differentiate them from tags downloaded from other sources. I have 3000+ books and I want each of them to be attacked by these 400+ regular expressions.
I need help because my code only contemplates the search in a single book and what I want to configure is:
**Run the code on selected books from my library (books_ids).**
**Found tags are added to the metadata.**
**Add a verification tag confirming that the book was processed.**
Code:
import re
import ast
from epub_conversion.utils import open_book, convert_epub_to_lines
import colorama
colorama.init()
"test_dict.txt = {'#publication_(*history)': r'\bpublication\b[^.]*\bhistory',
'#horror_fiction': r'\bhorror fiction',
'#story_(*writer)': r'\bstory\b[^.]*\bwriter',
'#published_(*books)': r'\bpublished\b[^.]*\bbooks',
'#books_(*poems)': r'\bbook[^.]*\bpoem',
'#new_discovery': r'\bnew discovery',
'#weird_tales': r'\bweird tales',
'#literary_(*importance)': r'\bliterary[^.]*\bimportance',
}"
book = open_book("Cthulhu Mythos.epub")
lines = convert_epub_to_lines(book)
with open("test_dict.txt", "r") as data:
tags_dict = ast.literal_eval(data.read())
print(colorama.Back.YELLOW + 'Matches(regex - book text):',colorama.Style.RESET_ALL)
temp = []
res = dict()
for line in lines:
for key,value in tags_dict.items():
if re.search(rf'{value}', line):
if value not in temp:
temp.append(value)
res[key] = value
regex = re.compile(value)
match_array = regex.finditer(line)
match_list = list(match_array)
for m in match_list:
print(colorama.Fore.MAGENTA + key, ":",colorama.Style.RESET_ALL + m.group())
print('\n',colorama.Back.YELLOW + 'Found tags:',colorama.Style.RESET_ALL)
temp = []
res = dict()
for line in lines:
for key,value in tags_dict.items():
if re.search(rf'{value}', line):
if value not in temp:
temp.append(value)
res[key] = value
print(colorama.Fore.GREEN + key, end=", ")
print('\n\n' + colorama.Back.YELLOW + "N° found tags:",colorama.Style.RESET_ALL, len(temp))
This prints:
Matches(regex - book text):
#story_(*writer) : story writer
#publication_(*history) : publication in the history
#horror_fiction : horror fiction
#published_(*books) : published three books
#books_(*poems) : books of poems professionally, and had even sold a couple of prose poem
#new_discovery : new discovery
#literary_(*importance) : literary outpouring of prodigious wordage and importance
Found tags:
#story_(*writer), #publication_(*history), #horror_fiction, #published_(*books), #books_(*poems), #new_discovery, #literary_(*importance),
N° found tags: 7
https://i.stack.imgur.com/iBE1f.png
The truth is that my knowledge of python is very poor, I only learned about regular expressions thanks to Calibre.
I'd appreciate any help with the code, thank you very much.

text format in python-docx

I need to enter text in 1000s word document. I found python-docx suitable for this purmose. But the string I have to enter has font color and style, like:
{\color{Red}Mr Mike} Bold Class, Italics College, City
Is it possible to format that?
As an mwe,
from docx import Document
document = Document('doc.docx')
Name = ["Name1", "Name2", "Name3"]
Class = ["Class1", "Class2", "Class3"]
for i in range(len(Name)):
string = Name[i] + ", " + Class[i] + "" Class"
# Name[i] should be red, bold, 12 pt; Classp[i] should be 12pt, italics
for p in document.paragraphs:
if p.text.find("REPLACE") >= 0:
p.text = p.text.replace("REPLACE", string)
document.save(Name[i] + '.docx')

Character formatting is applied at the run level. So you will need to work at that level if you want "inline" differences in character formatting. In general, a paragraph is composed of zero or more runs. All characters in a run share the same character-level formatting ("font"). Pseudo-code for a possible approach is:
old = "foo"
new = "bar"
text = paragraph.text
start = text.find(old)
prefix = text[:start]
suffix = text[start+len(old):]
# --- assigning to `paragraph.text` leaves the paragraph with
# --- exactly one run
paragraph.text = prefix
# --- add an additional run for your replacement word, keeping
# --- a reference to it so you can set its visual attributes
run = paragraph.add_run("bar")
run.italic = True
# --- etc. to get the formatting you want ---
# --- add the suffix to complete the original paragraph text ---
paragraph.add_run(suffix)

Python3; Docx; replacing text from word doc modifies spacing after paragraph--how to avoid or change back?

I have some functions that successfully replace specific strings from a .docx file (in both paragraphs and tables) with new strings.
Everything with the replacing part of the function works great. However, the "spacing after paragraph" values change when I replace the strings in the paragraph. I want the spacing to stay the same, or I want to be able to change the spacing back to its original format after the replacement occurs.
I am unfamiliar with docx, so maybe its just a simple solution that I missed?
A sample of the word .docx file can be downloaded here: https://filebin.net/9t5v96tb5y7z0e60
The working paragraph and table string replacement code:
"""
Script to replace specific texts within word doc templates
"""
import re, docx
from docx import Document
def clear_paragraph(self, paragraph):
p_element = paragraph._p
p_child_elements = [elm for elm in p_element.iterchildren()]
for child_element in p_child_elements:
p_element.remove(child_element)
def paragraph_replace(self, search, replace):
searchre = re.compile(search)
for paragraph in self.paragraphs:
paragraph_text = paragraph.text
if paragraph_text:
if searchre.search(paragraph_text):
clear_paragraph(self, paragraph)
paragraph.add_run(re.sub(search, replace, paragraph_text))
return paragraph
def table_replace(self, text_value, replace):
result = False
tbl_regex = re.compile(text_value)
for table in self.tables:
for row in table.rows:
for cell in row.cells:
if cell.text:
if tbl_regex.search(cell.text):
cell.text = replace
result = True
return result
regex1 = ["<<authors>>", "<<author>>", "<<id>>", \
"<<title>>", "<<date>>", "<<discipline>>", \
"<<countries>>"]
author = "Robert"
authors = "Robert, John; Bob, Billy; Duck, Donald"
ms_id = "2020-34-2321"
title = "blah blah blah and one more blah"
date = "31-03-2020"
discipline = "BE"
countries = "United States, Japan, China, South Africa"
replace1 = [authors, author, ms_id, title, date, discipline, countries]
filename = "Sample Template.docx"
doc = Document(filename)
for x in range(len(regex1)):
paragraph_replace(doc, regex1[x], replace1[x])
table_replace(doc, regex1[x], replace1[x])
doc.save(author + '_updated.docx')

After reading more of Docx documentation and some testing, the solution to this problem was easy.
def clear_paragraph(self, paragraph):
p_element = paragraph._p
p_child_elements = [elm for elm in p_element.iterchildren()]
for child_element in p_child_elements:
p_element.remove(child_element)
def paragraph_replace(self, search, replace, x):
searchre = re.compile(search)
for paragraph in self.paragraphs:
paragraph_text = paragraph.text
if paragraph_text:
if searchre.search(paragraph_text):
clear_paragraph(self, paragraph)
para = paragraph.add_run(re.sub(search, replace, paragraph_text))
para.font.size = Pt(10)
paragraph.paragraph_format.space_after=Pt(0)
if x is 2:
para.bold = True
else:
para.bold = False
paragraph.paragraph_format.line_spacing = 1.0
return paragraph
def table_replace(self, text_value, replace):
result = False
tbl_regex = re.compile(text_value)
for table in self.tables:
for row in table.rows:
for cell in row.cells:
paragraphs = cell.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
font = run.font
font.size=Pt(10)
if cell.text:
if tbl_regex.search(cell.text):
cell.text = replace
result = True
return result
I added the x argument in the paragraph_replace function because I wanted the first line of my document to be bold. All my issues are now resolved with these simple additions to the code.

pyparsing how to SkipTo end of indented block?

I am trying to parse a structure like this with pyparsing:
identifier: some description text here which will wrap
on to the next line. the follow-on text should be
indented. it may contain identifier: and any text
at all is allowed
next_identifier: more description, short this time
last_identifier: blah blah
I need something like:
import pyparsing as pp
colon = pp.Suppress(':')
term = pp.Word(pp.alphanums + "_")
description = pp.SkipTo(next_identifier)
definition = term + colon + description
grammar = pp.OneOrMore(definition)
But I am struggling to define the next_identifier of the SkipTo clause since the identifiers may appear freely in the description text.
It seems that I need to include the indentation in the grammar, so that I can SkipTo the next non-indented line.
I tried:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) +
pp.indentedBlock(
pp.ZeroOrMore(
pp.SkipTo(pp.LineEnd())
),
indent_stack
)
)
But I get the error:
ParseException: not a subentry (at char 55), (line:2, col:1)
Char 55 is at the very beginning of the run-on line:
...will wrap\n on to the next line...
^
Which seems a bit odd, because that char position is clearly followed by the whitespace which makes it an indented subentry.
My traceback in ipdb looks like:
5311 def checkSubIndent(s,l,t):
5312 curCol = col(l,s)
5313 if curCol > indentStack[-1]:
5314 indentStack.append( curCol )
5315 else:
-> 5316 raise ParseException(s,l,"not a subentry")
5317
ipdb> indentStack
[1]
ipdb> curCol
1
I should add that the whole structure above that I'm matching may also be indented (by an unknown amount), so a solution like:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd() +
pp.ZeroOrMore(
pp.White(' ') + pp.SkipTo(pp.LineEnd()) + pp.LineEnd()
)
)
...which works for the example as presented will not work in my case as it will consume the subsequent definitions.

When you use indentedBlock, the argument you pass in is the expression for each line in the block, so it shouldn't be a indentedBlock(ZeroOrMore(line_expression), stack), just indentedBlock(line_expression, stack). Pyparsing includes a builtin expression for "everything from here to the end of the line", titled restOfLine, so we will just use that for the expression for each line in the indented block:
import pyparsing as pp
NL = pp.LineEnd().suppress()
label = pp.ungroup(pp.Word(pp.alphas, pp.alphanums+'_') + pp.Suppress(":"))
indent_stack = [1]
# see corrected version below
#description = pp.Group((pp.Empty()
# + pp.restOfLine + NL
# + pp.ungroup(pp.indentedBlock(pp.restOfLine, indent_stack))))
description = pp.Group(pp.restOfLine + NL
+ pp.Optional(pp.ungroup(~pp.StringEnd()
+ pp.indentedBlock(pp.restOfLine,
indent_stack))))
labeled_text = pp.Group(label("label") + pp.Empty() + description("description"))
We use ungroup to remove the extra level of nesting created by indentedBlock but we also need to remove the per-line nesting that is created internally in indentedBlock. We do this with a parse action:
def combine_parts(tokens):
# recombine description parts into a single list
tt = tokens[0]
new_desc = [tt.description[0]]
new_desc.extend(t[0] for t in tt.description[1:])
# reassign rebuild description into the parsed token structure
tt['description'] = new_desc
tt[1][:] = new_desc
labeled_text.addParseAction(combine_parts)
At this point, we are pretty much done. Here is your sample text parsed and dumped:
parsed_data = (pp.OneOrMore(labeled_text)).parseString(sample)
print(parsed_data[0].dump())
['identifier', ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']]
- description: ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']
- label: 'identifier'
Or this code to pull out the label and description fields:
for item in parsed_data:
print(item.label)
print('..' + '\n..'.join(item.description))
print()
identifier
..some description text here which will wrap
..on to the next line. the follow-on text should be
..indented. it may contain identifier: and any text
..at all is allowed
next_identifier
..more description, short this time
last_identifier
..blah blah

Python Pattern Regex

I have a input text as follows:
SAVE_TIMECARD = "insert into sh_user_timecard (instance_id, user_id, in_time, in_time_activity_log_aid, in_time_activity_log_instance_id, " +"out_time, out_time_activity_log_aid, out_time_activity_log_instance_id, parent_aid, parent_instance_id)" + " values (:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId)";
The output I need is:
SAVE_TIMECARD =:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId
I've tried achieving this using:
result = re.findall(r'[A-z]+(:?=)',inputfile)
I need to extract the upper case words that is SAVE_TIMECARD and allthe words that starts with colon.

I found the solution
import re
regex = re.compile("^[^=]{0,}|:(\w{1,})")
testString = "private static final String SAVE_TIMECARD = "insert into sh_user_timecard (instance_id, user_id, in_time, in_time_activity_log_aid, in_time_activity_log_instance_id, " +"out_time, out_time_activity_log_aid, out_time_activity_log_instance_id, parent_aid, parent_instance_id)" + " values (:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId)";"
matchArray = regex.findall(testString)
the matchArray variable contains the list of matches

:\w+
Will identify the 'words starting with a colon'. You'll need to loop through the original text to find all instances.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I tokenize this text into sentences with Regex - python

Related

How extract text from selected books, convert them to tags, and add them to metadata in CALIBRE?

text format in python-docx

Python3; Docx; replacing text from word doc modifies spacing after paragraph--how to avoid or change back?

pyparsing how to SkipTo end of indented block?

Python Pattern Regex

Categories

Resources