I have a number of txt files in a format like this -
\n==== Intro \n text \n text \n==== Body \n text \n text \n==== Refs \n test \n text
I'd like to get these into a dictionary that looks like this -
{'Intro': '\n text \n text \n',
'Body': '\n text \n text',
'Refs': '\n test \n text'}
I'm concerned about the time it is going to take to process all of the txt files so wanted an approach that would take as little time as possible and I don't care about splitting the text into lines.
I am trying to use regex, but am struggling to get it to work correctly - I think my last regex group is incorrect. Below is what I currently have. Any suggestions would be great.
pattern = r"(====.)(.+?\b)(.*)"
matches = re.findall(pattern, data, re.DOTALL)
my_dict = {b:c for a,b,c in matches}
You don’t need RegEx here, instead you can use classic split() function.
Here, I use textwrap for readability:
import textwrap
text = textwrap.dedent("""\
==== Intro
text
text
==== Body
text
text
==== Refs
test
text""")
You can do:
result = {}
for part in text.split("==== "):
if not part.isspace():
section, content = part.split(' ', 1)
result[section] = content
Or initialise a dict with a list of tuples in comprehension:
result = dict(part.split(' ', 1)
for part in text.split("==== ")
if not part.isspace())
This should do:
d = dict(re.findall('(?<=\n====\s)(\w+)(\s+[^=]+)', text, re.M | re.DOTALL))
print(d)
{'Body': ' \n text \n text \n',
'Intro': ' \n text \n text \n',
'Refs': ' \n test \n text'}
Regex Details
(?<= # lookbehind (must be fixed width)
\n # newline
==== # four '=' chars in succession
\s # single wsp character
)
( # first capture group
\w+ # 1 or more alphabets (or underscore)
)
( # second capture group
\s+ # one or more wsp characters
[^=]+ # match any char that is not an '='
)
You can try this:
import re
s = "\n==== Intro \n text \n text \n==== Body \n text \n text \n==== Refs \n test \n text"
final_data = re.findall("(?<=\n\=\=\=\=\s)[a-zA-Z]+\s", s)
text = re.findall("\n .*? \n .*?$|\n .*? \n .*? \n", s)
final_body = {a:b for a, b in zip(final_data, text)}
Output:
{'Body ': '\n text \n text \n', 'Intro ': '\n text \n text \n', 'Refs ': '\n test \n text'}
If you do not want to read the whole file into memory, you can process it line-by-line like this:
marker = "==== "
def read_my_custom_format(file):
current_header = None
current_contents = []
for line in file:
line = line.strip() # trim whitespace, including trailing newline
if line.startswith(marker):
yield current_header, current_contents # emit current section
current_header = line[len(marker):] # trim marker
current_contents = []
else:
current_contents.append(line)
This is a generator yielding tuples instead of building a dictionary.
This way it only holds one section at a time in memory.
Also, each key maps to a list of lines instead of one string, but you could easily just "".join(iterable) them.
If you want to produce a single dictionary, which again takes memory proportional to the input file, you can just do it like this:
with open("your_textfile.txt") as file:
data = dict(read_my_custom_format(file))
Because dict() can take an iterable of 2-tuples
Related
I am trying to parse the key details of PDF papers via python, and extract the title of the paper, authors and their email
from PyPDF2 import PdfReader
reader = PdfReader("paper.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
returns the raw text of the PDF
'Title\nGoes\nHere\nAuthor Name (sdsd#mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'
I have a function which removes the newlines and tabs etc
def remove_newlines_tabs(text):
"""
This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
arguments:
input_text: "text" of type "String".
return:
value: "text" after removal of newlines, tabs, \\n, \\ characters.
Example:
Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
Output : This is her first day at this place. Please, Be nice to her.
"""
# Replacing all the occurrences of \n,\\n,\t,\\ with a space.
Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
return Formatted_text
which returns
'Title Goes Here Author Name (sdsd#mail.net) University of Teeyab September 6, 2022 Some text in the Document. '
which makes it easy to extract the email. How can I extract the Title of the PDF and the authors? The title is the most important thing but I am not sure of the best approach...
Here's the solution using regex based on the following assumptions
every word of title is separated by a newline character \n
every word of author is separated by a whitespace
email address is always wrapped by parentheses ()
import re
test_string = 'Title\nGoes\nHere\nAuthor Name (sdsd#mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'
# \w matches characters, numbers, and underscore
# \s matches whitespace and \t\n\r\f\v
# first, let's extract string that appears before parentheses
result = re.search(r"([\w\s]+)", test_string)
print(result) # <re.Match object; span=(0, 28), match='Title\nGoes\nHere\nAuthor Name '>
# clean up leading and trailing whitespaces using strip() and
# split the string by \n to separate title and author
title_author = result[0].strip().split("\n")
print(title_author) # ['Title', 'Goes', 'Here', 'Author Name']
# join the words of title as a single string
title = " ".join(title_author[:-1])
author = title_author[-1]
print(title) # Title Goes Here
print(author) # Author Name
I need data to train a bot, so I have scraped SO questions. How can I replace new lines without removing \n from strings?
If I have the following string:
"""You can use \n to print a new line.
Text text text."""
How can I get: You can use \n to print a new line. Text text text.
I've tried this: string.replace("\n","")
But I end up with: 'You can use to print a new line.Text text text.'
Since I'm dealing with programming questions, I'm destined to run into \n in a string and wouldn't want to replace that.
you could print it as a real string
this is done with the letter r
example 1:
print(r"You can use \n to print a new line.")
# You can use \n to print a new line.
this will not remove it, but make it visible as you want in the output
example 2:
text = r"You can use \n to print a new line."
print(text)
# You can use \n to print a new line.
If you are printing the string and the output is:
You can use \n to print a new line.
Text text text.
then the \n visible in the output is actually the backslash character followed by the letter n, and not a newline character.
Doing replace("\n", "") should not remove the sequence of characters \n, because the replace pattern "\n" itself is not the sequence of characters \n, but rather the actual single newline character. So it does not match the \n sequence of characters visible in your string, but it does match (and replace) the newline characters.
This REPL snippet illustrates that:
>>> x = """You can use \\n to print a new line.\n\nText text text.""" # this string literal is how you would create the string you have shown in you question.
>>> x == r"""You can use \n to print a new line.
...
... Text text text.""" # or you can use a raw string literal to initialize your variable, it is exactly the same thing
True
>>> print(x)
You can use \n to print a new line.
Text text text.
>>> print(x.replace("\n", ""))
You can use \n to print a new line.Text text text.
If you mean that you are creating a string with the literal:
"""You can use \n to print a new line.
Text text text."""
Then it is impossible to distinguish between the typed \n and the result of pressing the Enter key in your string literal (unless you use a raw string initializer, as other answers have explained). Once the code is interpreted by Python they are identical. Consider escaping the newline character in your literal to have it included in your string as is:
myString = """You can use \\n to print a new line.
Text text text."""
If you want to convert new lines to literal string \n, you can escape the slash character:
string.replace("\n","\\n")
The \n in your string is an escape sequence that gets evaluated to the newline character.
In [1]: s = """You can use \n to print a new line.
...:
...: Text text text."""
In [2]: print(s)
You can use
to print a new line.
Text text text.
If you want to actually include the characters \ and n in your string, you need to escape the backslash with another backslash.
In [3]: s = """You can use \\n to print a new line.
...:
...: Text text text."""
In [4]: print(s)
You can use \n to print a new line.
Text text text.
In [5]: print(s.replace("\n", ""))
You can use \n to print a new line.Text text text.
Alternatively, you could use a "raw string", i.e. a string prefixed with r, e.g. r"..." or r"""...""" but then you would no longer be able to use escape sequences such as \n to insert a newline character, \t to insert a tab, etc.
I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.
I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.
You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff
You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff
Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'
I have some strings and at some particular index i want to check if the next char is digit surrounded by one or more whitespaces.
For example
here is a string
'some data \n 8 \n more data'
lets say i am iterating the string and currently standing at index 8, and at that position i want to know that if the next char is digit and only digit ignoring all the whitespaces before and after.
So, for the above particular case it should tell me True and for string like below
'some data \n (8 \n more data'
it should tell me False
I tried the pattern below
r'\s*[0-9]+\s*'
but it doesn't work for me, may be i am using it incorrectly.
Your original regex didn't work because the "*" is saying "zero or more matches". Instead, you should use a "+", which means "one or more matches". See below:
>>> import re
>>> s = 'some data \n 8 \n more data'
>>> if re.search("\s+[0-9]+\s+", s): print True
...
True
>>> s = 'some data \n 8) \n more data'
>>> if re.search("\s+[0-9]+\s+", s): print True
...
>>> s = 'some data \n 8343 \n more data'
>>> if re.search("\s+[0-9]+\s+", s): print True
...
True
>>>
If you just want to capture a single digit surrounded by one or more spaces, remove the "+" in front of "[0-9]" like this:
re.search("\s+[0-9]\s+", s)
Try this:
(?<=\s)[0-9]+(?=\s)
This regex uses a look-ahead and a look-behind, such that it matches the number only when the characters before and after it are whitespace characters.
In verbose form:
(?<=\s) # match if whitespace before
[0-9]+ # match digits
(?=\s) # match if whitespace after
Without regex:
s1 = 'some data \n 8 \n more data'
s2 = 'some data \n (8 \n more data'
testString = lambda x: True if len(x.splitlines()[1].strip()) == 1 else False
print testString(s1)
print testString(s1)
Output:
True
False