Python3 and regex: how to remove lines of numbers? - python

I have a long text file converted from a PDF and I want to remove instances of some things, e.g. like page numbers that will appear by themselves but possibly surrounded by spaces. I made a regex that works on short lines: e.g.
news1 = 'Hello done.\n4\nNext paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news1)
print(m)
Hello done. Next paragraph.
But when I try this on more complex strings, it fails, e.g
news = '1 \n Hello done. \n 4 \n 44 \n Next paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news)
print(m)
1
Hello done. 44
Next paragraph.
How do I make this work across the entire file? Should I instead read line by line and deal with it per line, instead of trying to edit the whole string?
I've also tried using the periods to match with whatever but that doesn't get the initial '1' in the more complex string. So I guess I could do 2 regexs.
m = re.sub('. *[0-9] *.', '', news)
1
Hello done.
Next paragraph.
Thoughts?

I would recommend doing it line by line unless you have some specific reason to slurp it all in as a string. Then just a few regexes to clean it all up like:
#not sure how the pages are numbered, but perhaps...
text = re.sub(r"^\s*\d+\s*$", "", text)
#chuck a line in to strip out stuff in all caps of at least 3 letters
text = re.sub(r"[A-Z]{3,}", "", text)
#concatenate multiple whitespace to 1 space, handy to clean up the data
text = re.sub(r"\s+", " ", text)
#trim the start and end of the line
text = text.strip()
Just one strategy but that's the route I would go with, easy to maintain down the road as your business side comes up with "OH OH! Can you also replace any mention of 'Cat' with 'Dog'?" I think it's easier to toubleshoot/log your changes as well. Maybe even try using re.subn to track changes... ?

Related

Want to fetch the spacing between the words line by line from a PDF using python

I want to implement a code that can perform one simple task: Fetch the spacing between the words (line by line). The user input should be a PDF from which the lines should be recognized by the code. The PDF can contain different kinds of spacing and patterns.
There is the usage of isspace() in Python, but I don't think that would work in this scenario. Any kind of help would be very much appreciated.
Generally it will not be easy as there is not one answer, look at this page saved as PDF the gap between letters is not a fixed value, this is called kerning.
Each font letter is in effect standalone, so the last letter of one letter word can be any spacing from start of next letter word, usually the font metrics are needed so non-proportional letters one inch wide would be at one inch interval but void needs to be a small bit more than one inch apart for word space. But then again, may be kerned to a different value. Using kerning / justification / obliques the spacing needs much more complex values, such that, often you will see unsuitable spaces.
Basically every word space can be different on every page & every line in a page unlike here in HTML.
So, after a week me & my friend tried to solve the problem which gets the job done but not the perfect way. If anyone find this problem interesting, I'm sharing the code. Open to any suggestions. Thank you.
import re
import pdftotext
from glob import glob
st = glob('Tampered.pdf')
for i in st:
with open(1, "rb") as f:
pdf = pdftotext.PDF(f)
ls = []; text = ""
for j in range(len(pdf)):
ls.append(pdf[j])
text = text.join(ls)
text = re.sub('Page [0-9]*', '', text)
text = re.sub('/(\r\n)+|\r+|\n+|\t+/', '', text)
text = re.sub('TAMPERED.*', '', text)
# text = re.sub(' +', '', text)
text = text.strip()
def Spaces (input_list):
s = [i for i in input_list if i != '']
s_1 = s[:]
s = [input_list.index(s[j]) for j in range(len(s))]
print('Spaces between :- ')
for i in range(len(s)):
if i+1 < len(s):
print("\t\'{s_1[i]}\' and \'{s_1[i+1]}\' : {s[i+1] s[i]}")
input_list = text.split(" ")
Spaces (input_list)

Remove numbers from result; Python3

I'm creating a script that querys websites, and my results end up looking something like this
result = "
nameof1stlink
38
nameof2ndlink120
12
nameof3rdlink15
7
nameof4thlin...
k143
43
"
Basically, I want to remove the numbers that come after each line of text. That would be easy for me to do in a pattern, but there is the occasional long string that takes up two separate lines. There's also the matter of needing to keep the numbers in the actual text names.
I was thinking of checking each individual line for string length and just removing those w/o 5 or more letters / numbers, but I wasn't sure if that would work, and I wasn't too sure how to do it either.
Any help from you guys would be great.
Thanks! :)
You could maybe use regex matching, looking for a link-like string (allowing for newlines) followed by a number and a newline, which you'd want to ignore. Then, to accommodate multi-line links, use simple str.replace() to remove any occurrences of the consistent ...\n that occurs when the link is split across multiple lines.
What I have in mind, given the example you've provided, is this:
import re
result = """nameof1stlink
38
nameof2ndlink120
12
nameof3rdlink15
7
nameof4thlin...
k143
43"""
matches = re.findall(r'([A-Za-z0-9\n/_.-]+?)[0-9\n]+[\n\b]', result, flags=re.M)
# match this group '( ) ' ^
# shortest possible ' ? ' (multi-line
# at least one of ' + ' string input)
# these characters ' [A-Za-z0-9\n/_.-] '
# then, at least one ' + '
# digit or newline ' [0-9\n] '
# and ending with \n ' [\n\b]'
# or end-of-string
# matches = ['nameof1stlink', 'nameof2ndlink', 'nameof3rdlink', 'nameof4thlin...\nk']
links = [link.replace('...\n', '') for link in matches]
# links = ['nameof1stlink', 'nameof2ndlink', 'nameof3rdlink', 'nameof4thlink']
I'm not sure what your links look like, but I assumed [A-Za-z0-9/_.-] (alphanumerics plus /, _, ., and -) covers all the standard parts of hyperlinks. And \n needs to be thrown somewhere in there to accommodate for multi-line entries. You can modify this character class depending on what you expect your links to look like.

Python - how to separate paragraphs from text?

I need to separate texts into paragraphs and be able to work with each of them. How can I do that? Between every 2 paragraphs can be at least 1 empty line. Like this:
Hello world,
this is an example.
Let´s program something.
Creating new program.
Thanks in advance.
This sould work:
text.split('\n\n')
Try
result = list(filter(lambda x : x != '', text.split('\n\n')))
Not an entirely trivial problem, and the standard library doesn't seem to have any ready solutions.
Paragraphs in your example are split by at least two newlines, which unfortunately makes text.split("\n\n") invalid. I think that instead, splitting by regular expressions is a workable strategy:
import fileinput
import re
NEWLINES_RE = re.compile(r"\n{2,}") # two or more "\n" characters
def split_paragraphs(input_text=""):
no_newlines = input_text.strip("\n") # remove leading and trailing "\n"
split_text = NEWLINES_RE.split(no_newlines) # regex splitting
paragraphs = [p + "\n" for p in split_text if p.strip()]
# p + "\n" ensures that all lines in the paragraph end with a newline
# p.strip() == True if paragraph has other characters than whitespace
return paragraphs
# sample code, to split all script input files into paragraphs
text = "".join(fileinput.input())
for paragraph in split_paragraphs(text):
print(f"<<{paragraph}>>\n")
Edited to add:
It is probably cleaner to use a state machine approach. Here's a fairly simple example using a generator function, which has the added benefit of streaming through the input one line at a time, and not storing complete copies of the input in memory:
import fileinput
def split_paragraph2(input_lines):
paragraph = [] # store current paragraph as a list
for line in input_lines:
if line.strip(): # True if line is non-empty (apart from whitespace)
paragraph.append(line)
elif paragraph: # If we see an empty line, return paragraph (if any)
yield "".join(paragraph)
paragraph = []
if paragraph: # After end of input, return final paragraph (if any)
yield "".join(paragraph)
# sample code, to split all script input files into paragraphs
for paragraph in split_paragraph2(fileinput.input()):
print(f"<<{paragraph}>>\n")
I usually split then filter out the '' and strip. ;)
a =\
'''
Hello world,
this is an example.
Let´s program something.
Creating new program.
'''
data = [content.strip() for content in a.splitlines() if content]
print(data)
this is worked for me:
text = "".join(text.splitlines())
text.split('something that is almost always used to separate sentences (i.e. a period, question mark, etc.)')
Easier. I had the same problem.
Just replace the double \n\n entry by a term that you seldom see in the text (here ¾):
a ='''
Hello world,
this is an example.
Let´s program something.
Creating new program.'''
a = a.replace("\n\n" , "¾")
splitted_text = a.split('¾')
print(splitted_text)

How to find substring in a targeted string more accurately in python?

I know 'in' can find substring in another string just like this. [How to determine whether a substring is in a different string
But I could not how to find exactly substring in the below example:
text = '"Peter,just say hello world." Mary said "En..."'
I want to judge whether 'Peter' is in text but not in "XXXX" content. If I use
if 'Peter' in text:
print 'yes'
else:
print 'no'
But the result returns 'yes', which is wrong because 'Peter' is in "XXXXX".
Besides solving this problem, I want to get the left "XXXX" content. For example, 'Mary' is in text and not in "XXXX" content. I also want to get "Peter,just say hello world.".
To meet your own special requirements, I think it's a good way to process text letter by letter, it's a good way to train your skills in processing string. To this problem, you can use stack to store double quotation, so that you can judge whether a letter is in double quotation.
Like many string processing problems, regular expressions are your friend. One way to handle this problem is to start at the front of the string and incrementally process it.
Check the start of the string to see whether it's unquoted or quoted text. If it's unquoted, pull all the unquoted text off until you hit a quote. If it's quoted text, pull off everything until you hit an end quote. Keep processing the text until all the text has been processed and categorized as either quoted or unquoted.
You'll then have two separate lists of quoted and unquoted text strings. You can then do string inclusion checks in either list.
text = '"Peter,just say hello world." Mary said "En..."'
unquoted_text = []
quoted_text = []
while text:
# Pull unquoted text off the front
m = re.match(r'^([^"]+)(.*)$', text)
if m:
unquoted_text.append(m.group(1))
text = m.group(2)
# Pull quoted text off the front
m = re.match(r'^"([^"]*)"(.*)$', text)
if m:
quoted_text.append(m.group(1))
text = m.group(2)
# Just in case there is a single unmatched double quote (bad!)
# Categorize as unquoted
m = re.match(r'^"([^"]*)$', text)
if m:
unquoted_text.append(m.group(1))
text = ''
print 'UNQUOTED'
print unquoted_text
print 'QUOTED'
print quoted_text
is_peter_in_quotes = any(['Peter' in t for t in quoted_text])

stopword removal using python

All,
I have some text that I need to clean up and I have a little algorithm that "mostly" works.
def removeStopwords(self, data):
with open(r'stopwords.txt') as stopwords:
wordList = []
for i in stopwords:
wordList.append(i.strip())
charList = list(data)
cat = ''.join(char for char in charList if not char in wordList).split()
return ' '.join(cat)
Take the first line on this page. http://en.wikipedia.org/wiki/Paragraph and remove all the characters that we are not interested in which in this case are all the non-alphanumeric chars.
A paragraph (from the Greek paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.[1][2] The start of a paragraph is indicated by beginning on a new line. Sometimes the first line is indented. At various times, the beginning of a paragraph has been indicated by the pilcrow: ¶.
The output looks pretty good except that some of the words are recombined incorrectly and I am unsure how to correct it.
A paragraph from the Greek paragraphos to write beside or written beside is a selfcontained unit
Note the word "selfcontained" was "self-contained".
EDIT: Contents of the stopwords file which is just a bunch of chars.
!
$
%
^
,
&
*
(
)
{
}
[
]
<
,
.
/
|
\
?
~
`
:
;
"
Turns out I don't need a list of words at all because I was only really trying to remove characters which in this case were punctuation marks.
cat = ''.join(data.translate(None, string.punctuation)).split()
print ' '.join(cat).lower()
version 2.x
line = 'hello!'
line.translate(None, '!$%') #'hello'
answers
Load your stopwords/stopchars in a separate function.
Don't hard-code file names/paths.
Your wordList should be a set, not a list.
However if you are working with chars, not words, investigate str.translate.
One way to go would be to use the replace method and have an exhaustive list of characters you don't want.
for example:
c=['a','h']
a= 'john'
for item in c:
a =a.replace(item,'')
print a
prints the following:
John
Jon

Categories