Extracting text ending with escape characters in python - python

I am trying to parse the key details of PDF papers via python, and extract the title of the paper, authors and their email
from PyPDF2 import PdfReader
reader = PdfReader("paper.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
returns the raw text of the PDF
'Title\nGoes\nHere\nAuthor Name (sdsd#mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'
I have a function which removes the newlines and tabs etc
def remove_newlines_tabs(text):
"""
This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
arguments:
input_text: "text" of type "String".
return:
value: "text" after removal of newlines, tabs, \\n, \\ characters.
Example:
Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
Output : This is her first day at this place. Please, Be nice to her.
"""
# Replacing all the occurrences of \n,\\n,\t,\\ with a space.
Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
return Formatted_text
which returns
'Title Goes Here Author Name (sdsd#mail.net) University of Teeyab September 6, 2022 Some text in the Document. '
which makes it easy to extract the email. How can I extract the Title of the PDF and the authors? The title is the most important thing but I am not sure of the best approach...

Here's the solution using regex based on the following assumptions
every word of title is separated by a newline character \n
every word of author is separated by a whitespace
email address is always wrapped by parentheses ()
import re
test_string = 'Title\nGoes\nHere\nAuthor Name (sdsd#mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'
# \w matches characters, numbers, and underscore
# \s matches whitespace and \t\n\r\f\v
# first, let's extract string that appears before parentheses
result = re.search(r"([\w\s]+)", test_string)
print(result) # <re.Match object; span=(0, 28), match='Title\nGoes\nHere\nAuthor Name '>
# clean up leading and trailing whitespaces using strip() and
# split the string by \n to separate title and author
title_author = result[0].strip().split("\n")
print(title_author) # ['Title', 'Goes', 'Here', 'Author Name']
# join the words of title as a single string
title = " ".join(title_author[:-1])
author = title_author[-1]
print(title) # Title Goes Here
print(author) # Author Name

Related

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Replace substrings with items from list

Basically, I have a string that has multiple double-whitespaces like this:
"Some text\s\sWhy is there no punctuation\s\s"
I also have a list of punctuation marks that should replace the double-whitespaces, so that the output would be this:
puncts = ['.', '?']
# applying some function
# output:
>>> "Some text. Why is there no punctuation?"
I have tried re.sub(' +', puncts[i], text) but my problem here is that I don't know how to properly iterate through the list and replace the 1st double-whitespace with the 1st element in puncts, the 2nd double-whitespace with the 2nd element in puncts and so on.
If we're still using re.sub(), here's one possible solution that follows this basic pattern:
Get the next punctuation character.
Replace only the first occurrence of that character in text.
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s(?=\s)', i, text, 1)
The call to re.sub() returns a string, and basically says "find all series of two whitespace characters, but only replace the first whitespace character with a punctuation character." The final argument "1" makes it so that we only replace the first instance of the double whitespace, and not all of them (default behavior).
If the positive lookahead (the part of the regex that we want to match but not replace) confuses you, you can also do without it:
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s\s', i + " ", text, 1)
This yields the same output.
There will be a leftover whitespace at the end of the sentence, but if you're stingy about that, a simple text.rstrip() should take care of that one.
Further explanation
Your first try of using regex ' +' doesn't work because that regex matches all instances where there is at least one whitespace — that is, it will match everything, and then also replace all of it with a punctuation character. The above solutions account for the double-whitespace in their respective regexes.
You can do it simply using the replace method!
text = "Some text Why is there no punctuation "
puncts = ['.', '?']
for i in puncts:
text = text.replace(" ", i, 1) #notice the 1 here
print(text)
Output : Some text.Why is there no punctuation?
You can use re.split() to break the string into substrings between the double spaces and intersperse the punctuation marks using join:
import re
string = "Some text Why is there no punctuation "
iPunct = iter([". ","? "])
result = "".join(x+next(iPunct,"") for x in re.split(r"\s\s",string))
print(result)
# Some text. Why is there no punctuation?

get full string before and after a specific pattern

I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.
You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff
You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff
Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'

Convert a string to a dictionary using regex-grouping

I have a number of txt files in a format like this -
\n==== Intro \n text \n text \n==== Body \n text \n text \n==== Refs \n test \n text
I'd like to get these into a dictionary that looks like this -
{'Intro': '\n text \n text \n',
'Body': '\n text \n text',
'Refs': '\n test \n text'}
I'm concerned about the time it is going to take to process all of the txt files so wanted an approach that would take as little time as possible and I don't care about splitting the text into lines.
I am trying to use regex, but am struggling to get it to work correctly - I think my last regex group is incorrect. Below is what I currently have. Any suggestions would be great.
pattern = r"(====.)(.+?\b)(.*)"
matches = re.findall(pattern, data, re.DOTALL)
my_dict = {b:c for a,b,c in matches}
You don’t need RegEx here, instead you can use classic split() function.
Here, I use textwrap for readability:
import textwrap
text = textwrap.dedent("""\
==== Intro
text
text
==== Body
text
text
==== Refs
test
text""")
You can do:
result = {}
for part in text.split("==== "):
if not part.isspace():
section, content = part.split(' ', 1)
result[section] = content
Or initialise a dict with a list of tuples in comprehension:
result = dict(part.split(' ', 1)
for part in text.split("==== ")
if not part.isspace())
This should do:
d = dict(re.findall('(?<=\n====\s)(\w+)(\s+[^=]+)', text, re.M | re.DOTALL))
print(d)
{'Body': ' \n text \n text \n',
'Intro': ' \n text \n text \n',
'Refs': ' \n test \n text'}
Regex Details
(?<= # lookbehind (must be fixed width)
\n # newline
==== # four '=' chars in succession
\s # single wsp character
)
( # first capture group
\w+ # 1 or more alphabets (or underscore)
)
( # second capture group
\s+ # one or more wsp characters
[^=]+ # match any char that is not an '='
)
You can try this:
import re
s = "\n==== Intro \n text \n text \n==== Body \n text \n text \n==== Refs \n test \n text"
final_data = re.findall("(?<=\n\=\=\=\=\s)[a-zA-Z]+\s", s)
text = re.findall("\n .*? \n .*?$|\n .*? \n .*? \n", s)
final_body = {a:b for a, b in zip(final_data, text)}
Output:
{'Body ': '\n text \n text \n', 'Intro ': '\n text \n text \n', 'Refs ': '\n test \n text'}
If you do not want to read the whole file into memory, you can process it line-by-line like this:
marker = "==== "
def read_my_custom_format(file):
current_header = None
current_contents = []
for line in file:
line = line.strip() # trim whitespace, including trailing newline
if line.startswith(marker):
yield current_header, current_contents # emit current section
current_header = line[len(marker):] # trim marker
current_contents = []
else:
current_contents.append(line)
This is a generator yielding tuples instead of building a dictionary.
This way it only holds one section at a time in memory.
Also, each key maps to a list of lines instead of one string, but you could easily just "".join(iterable) them.
If you want to produce a single dictionary, which again takes memory proportional to the input file, you can just do it like this:
with open("your_textfile.txt") as file:
data = dict(read_my_custom_format(file))
Because dict() can take an iterable of 2-tuples

Replace all text between 2 strings python

Lets say I have:
a = r''' Example
This is a very annoying string
that takes up multiple lines
and h#s a// kind{s} of stupid symbols in it
ok String'''
I need a way to do a replace(or just delete) and text in between "This" and "ok" so that when I call it, a now equals:
a = "Example String"
I can't find any wildcards that seem to work. Any help is much appreciated.
You need Regular Expression:
>>> import re
>>> re.sub('\nThis.*?ok','',a, flags=re.DOTALL)
' Example String'
Another method is to use string splits:
def replaceTextBetween(originalText, delimeterA, delimterB, replacementText):
leadingText = originalText.split(delimeterA)[0]
trailingText = originalText.split(delimterB)[1]
return leadingText + delimeterA + replacementText + delimterB + trailingText
Limitations:
Does not check if the delimiters exist
Assumes that there are no duplicate delimiters
Assumes that delimiters are in correct order
The DOTALL flag is the key. Ordinarily, the '.' character doesn't match newlines, so you don't match across lines in a string. If you set the DOTALL flag, re will match '.*' across as many lines as it needs to.
Use re.sub : It replaces the text between two characters or symbols or strings with desired character or symbol or string.
format: re.sub('A?(.*?)B', P, Q, flags=re.DOTALL)
where
A : character or symbol or string
B : character or symbol or string
P : character or symbol or string which replaces the text between A and B
Q : input string
re.DOTALL : to match across all lines
import re
re.sub('\nThis?(.*?)ok', '', a, flags=re.DOTALL)
output : ' Example String'
Lets see an example with html code as input
input_string = '''<body> <h1>Heading</h1> <p>Paragraph</p><b>bold text</b></body>'''
Target : remove <p> tag
re.sub('<p>?(.*?)</p>', '', input_string, flags=re.DOTALL)
output : '<body> <h1>Heading</h1> <b>bold text</b></body>'
Target : replace <p> tag with word : test
re.sub('<p>?(.*?)</p>', 'test', input_string, flags=re.DOTALL)
otput : '<body> <h1>Heading</h1> test<b>bold text</b></body>'
a=re.sub('This.*ok','',a,flags=re.DOTALL)
If you want first and last words:
re.sub(r'^\s*(\w+).*?(\w+)$', r'\1 \2', a, flags=re.DOTALL)

Categories